Trying to use tryCatch. What I want is to run through a list of urls that I have stored in page1URLs and if there is a problem with one of them (using readHTMLTable() )I want a record of which ones and then I want the code to go on to the next url without crashing.
I think I don't have the right idea here at all. Can anyone suggest how I can do this?
Here is the beginning of the code:
baddy <- rep(NA,10,000)
badURLs <- function(url) { baddy=c(baddy,url) }
writeURLsToCsvExtrema(38.361042, 35.465144, 141.410522, 139.564819)
writeURLsToCsvExtrema <- function(maxlat, minlat, maxlong, minlong) {
urlsFuku <- page1URLs
allFuku <- data.frame() # need to initialize it with column names
for (url in urlsFuku) {
tryCatch(temp.tables=readHTMLTable(url), finally=badURLs(url))
temp.df <- temp.tables[[3]]
lastrow <- nrow(temp.df)
temp.df <- temp.df[-c(lastrow-1,lastrow),]
}
One general approach is to write a function that fully processes one URL, returning either the computed value or NULL to indicate failure
FUN = function(url) {
tryCatch({
xx <- readHTMLTable(url) ## will sometimes fail, invoking 'error' below
## more calculations
xx ## final value
}, error=function(err) {
## what to do on error? could return conditionMessage(err) or other...
NULL
})
}
and then use this, e.g., with a named vector
urls <- c("http://cran.r-project.org", "http://stackoverflow.com",
"http://foo.bar")
names(urls) <- urls # add names to urls, so 'result' elements are named
result <- lapply(urls, FUN)
These guys failed (returned NULL)
> names(result)[sapply(result, is.null)]
[1] "http://foo.bar"
And these are the results for further processing
final <- Filter(Negate(is.null), result)
Related
I am writting a code in R to read a list of CSV files from 6000 URLs.
data <- read.csv(url)
If R can not acces a URL, the code execution stops. Anyone knows hot to avoid this error stop in R?
I have been looking for any argument for the read.csv functionm but probably there is a function.
Simply use tryCatch to catch the error inside the loop but continue on to other iterations:
# DEFINED METHOD, ON ERROR PRINTS MESSAGE AND RETURNS NULL
read_data_from_url <- function(url) {
tryCatch({
read.csv(url)
}, error = function(e) {
print(e)
return(NULL)
})
}
# NAMED LIST OF DATA FRAMES
df_list <- sapply(
list_of_6000_urls, read_data_from_url, simplify = FALSE
)
# FILTER OUT NULLS (PROBLEMATIC URLS)
df_list <- Filter(NROW, df_list)
I have a list of URLs (more than 4000) from a specific domain (pixilink.com) and what I want to do is to figure out if the provided domain is a picture or a video. To do this, I used the solutions provided here: How to write trycatch in R and Check whether a website provides photo or video based on a pattern in its URL and wrote the code shown below:
#Function to get the value of initial_mode from the URL
urlmode <- function(x){
mycontent <- readLines(x)
mypos <- grep("initial_mode = ", mycontent)
if(grepl("0", mycontent[mypos])){
return("picture")
} else if(grepl("tour", mycontent[mypos])){
return("video")
} else{
return(NA)
}
}
Also, in order to prevent having error for URLs that don't exist, I used the code below:
readUrl <- function(url) {
out <- tryCatch(
{
readLines(con=url, warn=FALSE)
return(1)
},
error=function(cond) {
return(NA)
},
warning=function(cond) {
return(NA)
},
finally={
message( url)
}
)
return(out)
}
Finally, I separated the list of URLs and pass it into the functions (here for instance, I used 1000 values from URL list) described above:
a <- subset(new_df, new_df$host=="www.pixilink.com")
vec <- a[['V']]
vec <- vec[1:1000] # only chose first 1000 rows
tt <- numeric(length(vec)) # checking validity of url
for (i in 1:length(vec)){
tt[i] <- readUrl(vec[i])
print(i)
}
g <- data.frame(vec,tt)
g2 <- g[which(!is.na(g$tt)),] #only valid url
dd <- numeric(nrow(g2))
for (j in 1:nrow(g2)){
dd[j] <- urlmode(g2[j,1])
}
Final <- cbind(g2,dd)
Final <- left_join(g, Final, by = c("vec" = "vec"))
I ran this code on a sample list of URLs with 100, URLs and it worked; however, after I ran it on whole list of URLs, it returned an error. Here is the error : Error in textConnection("rval", "w", local = TRUE) : all connections are in use Error in textConnection("rval", "w", local = TRUE) : all connections are in use
And after this even for sample URLs (100 samples that I tested before) I ran the code and got this error message : Error in file(con, "r") : all connections are in use
I also tried closeAllConnection after each recalling each function in the loop, but it didn't work.
Can anyone explain what this error is about? is it related to the number of requests we can have from the website? what's the solution?
So, my guess as to why this is happening is because you're not closing the connections that you're opening via tryCatch() and via urlmode() through the use of readLines(). I was unsure of how urlmode() was going to be used in your previous post so it had made it as simple as I could (and in hindsight, that was badly done, my apologies). So I took the liberty of rewriting urlmode() to try and make it a little bit more robust for what appears to be a more expansive task at hand.
I think the comments in the code should help, so take a look below:
#Updated URL mode function with better
#URL checking, connection handling,
#and "mode" investigation
urlmode <- function(x){
#Check if URL is good to go
if(!httr::http_error(x)){
#Test cases
#x <- "www.pixilink.com/3"
#x <- "https://www.pixilink.com/93320"
#x <- "https://www.pixilink.com/93313"
#Then since there are redirect shenanigans
#Get the actual URL the input points to
#It should just be the input URL if there is
#no redirection
#This is important as this also takes care of
#checking whether http or https need to be prefixed
#in case the input URL is supplied without those
#(this can cause problems for url() below)
myx <- httr::HEAD(x)$url
#Then check for what the default mode is
mycon <- url(myx)
open(mycon, "r")
mycontent <- readLines(mycon)
mypos <- grep("initial_mode = ", mycontent)
#Close the connection since it's no longer
#necessary
close(mycon)
#Some URLs with weird formats can return
#empty on this one since they don't
#follow the expected format.
#See for example: "https://www.pixilink.com/clients/899/#3"
#which is actually
#redirected from "https://www.pixilink.com/3"
#After that, evaluate what's at mypos, and always
#return the actual URL
#along with the result
if(!purrr::is_empty(mypos)){
#mystr<- stringr::str_extract(mycontent[mypos], "(?<=initial_mode\\s\\=).*")
mystr <- stringr::str_extract(mycontent[mypos], "(?<=\').*(?=\')")
return(c(myx, mystr))
#return(mystr)
#So once all that is done, check if the line at mypos
#contains a 0 (picture), tour (video)
#if(grepl("0", mycontent[mypos])){
# return(c(myx, "picture"))
#return("picture")
#} else if(grepl("tour", mycontent[mypos])){
# return(c(myx, "video"))
#return("video")
#}
} else{
#Valid URL but not interpretable
return(c(myx, "uninterpretable"))
#return("uninterpretable")
}
} else{
#Straight up invalid URL
#No myx variable to return here
#Just x
return(c(x, "invalid"))
#return("invalid")
}
}
#--------
#Sample code execution
library(purrr)
library(parallel)
library(future.apply)
library(httr)
library(stringr)
library(progressr)
library(progress)
#All future + progressr related stuff
#learned courtesy
#https://stackoverflow.com/a/62946400/9494044
#Setting up parallelized execution
no_cores <- parallel::detectCores()
#The above setup will ensure ALL cores
#are put to use
clust <- parallel::makeCluster(no_cores)
future::plan(cluster, workers = clust)
#Progress bar for sanity checking
progressr::handlers(progressr::handler_progress(format="[:bar] :percent :eta :message"))
#Website's base URL
baseurl <- "https://www.pixilink.com"
#Using future_lapply() to recursively apply urlmode()
#to a sequence of the URLs on pixilink in parallel
#and storing the results in sitetype
#Using a future chunk size of 10
#Everything is wrapped in with_progress() to enable the
#progress bar
#
range <- 93310:93350
#range <- 1:10000
progressr::with_progress({
myprog <- progressr::progressor(along = range)
sitetype <- do.call(rbind, future_lapply(range, function(b, x){
myprog() ##Progress bar signaller
myurl <- paste0(b, "/", x)
cat("\n", myurl, " ")
myret <- urlmode(myurl)
cat(myret, "\n")
return(c(myurl, myret))
}, b = baseurl, future.chunk.size = 10))
})
#Converting into a proper data.frame
#and assigning column names
sitetype <- data.frame(sitetype)
names(sitetype) <- c("given_url", "actual_url", "mode")
#A bit of wrangling to tidy up the mode column
sitetype$mode <- stringr::str_replace(sitetype$mode, "0", "picture")
head(sitetype)
# given_url actual_url mode
# 1 https://www.pixilink.com/93310 https://www.pixilink.com/93310 invalid
# 2 https://www.pixilink.com/93311 https://www.pixilink.com/93311 invalid
# 3 https://www.pixilink.com/93312 https://www.pixilink.com/93312 floorplan2d
# 4 https://www.pixilink.com/93313 https://www.pixilink.com/93313 picture
# 5 https://www.pixilink.com/93314 https://www.pixilink.com/93314 floorplan2d
# 6 https://www.pixilink.com/93315 https://www.pixilink.com/93315 tour
unique(sitetype$mode)
# [1] "invalid" "floorplan2d" "picture" "tour"
#--------
Basically, urlmode() now opens and closes connections only when necessary, checks for URL validity, URL redirection, and also "intelligently" extracts the value assigned to initial_mode. With the help of future.lapply(), and the progress bar from the progressr package, this can now be applied quite conveniently in parallel to as many pixilink.com/<integer> URLs as desired. With a bit of wrangling thereafter, the results can be presented very tidily as a data.frame as shown.
As an example, I've demonstrated this for a small range in the code above. Note the commented out 1:10000 range in the code in this context: I let this code run the last couple of hours over this (hopefully sufficiently) large range of URLs to check for errors and problems. I can attest that I encountered no errors (only the regular warnings In readLines(mycon) : incomplete final line found on 'https://www.pixilink.com/93334'). For proof, I have the data from all 10000 URLs written to a CSV file that I can provide upon request (I don't fancy uploading that to pastebin or elsewhere unnecessarily). Due to oversight on my part, I forgot to benchmark that run, but I suppose I could do that later if performance metrics are desired/would be considered interesting.
For your purposes, I believe you can simply take the entire code snippet below and run it verbatim (or with modifications) by just changing the range assignment right before the with_progress(do.call(...)) step to a range of your liking. I believe this approach is simpler and does away with having to deal with multiple functions and such (and no tryCatch() messes to deal with).
I'm very new at R and I would like to do a loop in order to return search volume (through an API call) for a list of keywords.
Here the code that I used :
install.packages("SEMrushR")
library(SEMrushR)
mes_keywords_to_check <- readLines("voyage.txt") # List of keywords to check
mes_keywords_to_check <- as.character(mes_keywords_to_check)
Loop
for (i in 1:length(mes_keywords_to_check)) {
test_keyword <- as.character(mes_keywords_to_check[i])
df_test_2 <- keyword_overview_all(test_keyword, "fr","API KEY NUMBER") ##keyword_overview_all is the function from the Semrush package
}
By doing this, I only get the Search Volume for the first keyword in the list. My purpose if of course to get the date required for the full list of keywords.
Here is the table that I get:
enter image description here
Do you have any idea how I could solve this issue?
Well, you need to add your results to some kind of container. for example to a list. As of now, you have just one object that gets filled with data from the most recent iteration of your loop.
results = list()
for (i in 1:length(mes_keywords_to_check)) {
test_keyword <- as.character(mes_keywords_to_check[i])
df_test_2 <- keyword_overview_all(test_keyword, "fr","API KEY NUMBER") ##keyword_overview_all is the function from the Semrush package
results[[i]] <- df_test_2
}
But, most R experts would suggest to refrain from using a loop
library("plyr")
result <- plyr::ldply(mes_keywords_to_check, function(x) keyword_overview_all(as.character(x), "fr","API KEY NUMBER"))
I did not test this, and it probably needs some tweaking, but it should point you in the right direction.
It looks like you're reading in the text file with readLines("voyage.txt") which will return a list of each line. These lines are then being passed to the for loop. The below will convert the lines to words. There are various approaches, but below uses a loop within a loop to keep using for() and in case you prefer to search line-by-line-word-by-word. It also uses a regex to split on non-alpha-numeric so that you omit words bounded by punctuation.
mes_lines <- readLines("voyage.txt") # List of keywords to check
mes_lines <- as.character(mes_lines)
search_results <- list()
for (i in 1:length(mes_lines)) {
mes_keywords_to_check <- unlist(strsplit(mes_lines,"[^[:alnum:]]"))
mes_keywords_to_check <- mes_keywords_to_check[nchar(mes_keywords_to_check)>0]
if (length(mes_keywords_to_check)==0) next
for (w in 1:length(mes_keywords_to_check))
{
test_keyword <- as.character(mes_keywords_to_check[w])
print(paste0("Checking word=",test_keyword))
df_test_2 <- keyword_overview_all(test_keyword, "fr","API KEY NUMBER") ##keyword_overview_all is the function from the Semrush package
search_results <- append(search_results,df_test_2)
}
}
search_results
Thanks for pointing to the right direction.
Here is what I did, and this is working:
final_result <- data.frame()
mes_keywords_to_check <- readLines("voyage.txt")
mes_keywords_to_check <- as.character(mes_keywords_to_check)
for (i in 1:length(mes_keywords_to_check)) {
test_keyword <- as.character(mes_keywords_to_check[i])
df_test_2 <- keyword_overview_all(test_keyword, "fr","API KEY")
final_result <- rbind(final_result,df_test_2)
}
readStateData <- function() {
infile <- paste("state",i,".txt",sep="")
state <- readLines(infile,n=1)
statedata <- read.table(infile,header=FALSE,sep=",",skip=1,col.names=c("Rank","City","Population"))
statename <- list(state,statedata)
statename
}
# Start loop
for(i in 1:50) {
readStateData()
# Add function to big.list
big.list[[i]] <- readStateData(statename)
}
The assignment for class is to bring in 50 files, all named state#.txt, get the state via readLines, get the data via read.table, and ultimately put it all into big.list that'll have all of the data through a for loop.
The problem I'm having is calling the function in during the for loop. I get the error:
Error in readStateData(statename) : unused argument (statename)
I'm either not calling in the function properly or I've written the function wrong. Both are likely.
Thank you for your help.
You have different issues here.
Do not refer inside a function to a variable which is defined outside. It means instead of access an outside the function defined i inside the function:
i <- 1
fct <- function() {
a <- i + 1
return(a)
}
fct()
Pass the variable as an argument to the function:
i <- 1
fct <- function(x) {
a <- x + 1
return(a)
}
fct(i)
In your function the return statement is missing. See point 1 the last command in the functions. Without a return statement the last written variable is on the stack and is "returned" by the function. This is not the clean way to return a value.
Ergo your code should look like this
readStateData <- function(x) {
infile <- paste("state",x,".txt",sep="")
state <- readLines(infile,n=1)
statedata <-read.table(infile,header=FALSE,sep=",",skip=1,col.names=c("Rank","City","Population"))
statename <- list(state,statedata)
return(statename)
}
# Start loop
for(i in 1:50) {
j <- readStateData(i)
# Add function to big.list
big.list[[i]] <- j
}
If your files are all of the pattern: state[number].txt you can simplify your code to:
# Get all files with pattern state*.txt
fls <- dir(pattern='state.*txt')
readStateData <- function(x) {
state <- readLines(x, n=1)
statedata <-read.table(x, header=FALSE,sep=",",skip=1,col.names=c("Rank","City","Population"))
statename <- list(state,statedata)
return(statename)
}
# Start loop
for(i in 1:length(fls)) {
j <- readStateData(fls[i])
# Add function to big.list
big.list[[i]] <- j
}
I'm running this function:
require(XML)
require(plyr)
getKeyStats_xpath <- function(symbol) {
yahoo.URL <- "http://finance.yahoo.com/q/ks?s="
html_text <- htmlParse(paste(yahoo.URL, symbol, sep = ""), encoding="UTF-8")
#search for <td> nodes anywhere that have class 'yfnc_tablehead1'
nodes <- getNodeSet(html_text, "/*//td[#class='yfnc_tablehead1']")
if(length(nodes) > 0 ) {
measures <- sapply(nodes, xmlValue)
#Clean up the column name
measures <- gsub(" *[0-9]*:", "", gsub(" \\(.*?\\)[0-9]*:","", measures))
#Remove dups
dups <- which(duplicated(measures))
#print(dups)
for(i in 1:length(dups))
measures[dups[i]] = paste(measures[dups[i]], i, sep=" ")
#use siblings function to get value
values <- sapply(nodes, function(x) xmlValue(getSibling(x)))
df <- data.frame(t(values))
colnames(df) <- measures
return(df)
} else {
break
}
}
As long as the page exists, it works fine. However, if one of my tickers does NOT have any data on that URL, it throws an error:
Error in FUN(X[[3L]], ...) : no loop for break/next, jumping to top level
I added a trace too, and things break down on ticker number 3.
tickers <- c("QLTI",
"RARE",
"RCPT",
"RDUS",
"REGN",
"RGEN",
"RGLS")
tryCatch({
stats <- ldply(tickers, getKeyStats_xpath)
}, finally={})
I'd like to call the function like this:
stats <- ldply(tickers, getKeyStats_xpath)
rownames(stats) <- tickers
write.csv(t(stats), "FinancialStats_updated.csv",row.names=TRUE)
Basically, if a ticker has no data, I want to skip it.
Can someone please help me get this working?
Expanding on my comment. The issue here is you've enclosed the entire command stats <- ldply(tickers, getKeyStats_xpath) within a tryCatch. This means R will try to get key stats from every ticker.
Instead, what you want is to try each ticker.
To do this, write a wrapper for getKeyStats_xpath that encloses it in tryCatch. you could do this within ldply with an anonymous function, for example ldply(tickers, function (t) tryCatch(getKeyStats_xpath(t), finally={})). Note that finally executes regardless of exit condition, so finally={} executes nothing. (See Advanced R or How to write try catch in R from r-faq for more).
On an error, tryCatch calls the function provided in the argument error. So as is, this code still won't help as the error is unhandled (thanks to rawr for pointing this out earlier). It is also easier to inspect the output if you use llply instead, then
So a complete answer using this approach, and with informative error handling, is below.
stats <- llply(tickers,
function(t) tryCatch(getKeyStats_xpath(t),
error=function(x) {
cat("error occurred for:\n", t, "\n...skipping this ticker\n")
}
)
)
names(stats) <- tickers
lapply(stats, length)
#<snip>
#$RCPT
#[1] 0
# </snip>
As of now, this works for me, returning data for all tickers except the one listed in the code block above.