Using tryCatch within plyr - r

I'm running this function:
require(XML)
require(plyr)
getKeyStats_xpath <- function(symbol) {
yahoo.URL <- "http://finance.yahoo.com/q/ks?s="
html_text <- htmlParse(paste(yahoo.URL, symbol, sep = ""), encoding="UTF-8")
#search for <td> nodes anywhere that have class 'yfnc_tablehead1'
nodes <- getNodeSet(html_text, "/*//td[#class='yfnc_tablehead1']")
if(length(nodes) > 0 ) {
measures <- sapply(nodes, xmlValue)
#Clean up the column name
measures <- gsub(" *[0-9]*:", "", gsub(" \\(.*?\\)[0-9]*:","", measures))
#Remove dups
dups <- which(duplicated(measures))
#print(dups)
for(i in 1:length(dups))
measures[dups[i]] = paste(measures[dups[i]], i, sep=" ")
#use siblings function to get value
values <- sapply(nodes, function(x) xmlValue(getSibling(x)))
df <- data.frame(t(values))
colnames(df) <- measures
return(df)
} else {
break
}
}
As long as the page exists, it works fine. However, if one of my tickers does NOT have any data on that URL, it throws an error:
Error in FUN(X[[3L]], ...) : no loop for break/next, jumping to top level
I added a trace too, and things break down on ticker number 3.
tickers <- c("QLTI",
"RARE",
"RCPT",
"RDUS",
"REGN",
"RGEN",
"RGLS")
tryCatch({
stats <- ldply(tickers, getKeyStats_xpath)
}, finally={})
I'd like to call the function like this:
stats <- ldply(tickers, getKeyStats_xpath)
rownames(stats) <- tickers
write.csv(t(stats), "FinancialStats_updated.csv",row.names=TRUE)
Basically, if a ticker has no data, I want to skip it.
Can someone please help me get this working?

Expanding on my comment. The issue here is you've enclosed the entire command stats <- ldply(tickers, getKeyStats_xpath) within a tryCatch. This means R will try to get key stats from every ticker.
Instead, what you want is to try each ticker.
To do this, write a wrapper for getKeyStats_xpath that encloses it in tryCatch. you could do this within ldply with an anonymous function, for example ldply(tickers, function (t) tryCatch(getKeyStats_xpath(t), finally={})). Note that finally executes regardless of exit condition, so finally={} executes nothing. (See Advanced R or How to write try catch in R from r-faq for more).
On an error, tryCatch calls the function provided in the argument error. So as is, this code still won't help as the error is unhandled (thanks to rawr for pointing this out earlier). It is also easier to inspect the output if you use llply instead, then
So a complete answer using this approach, and with informative error handling, is below.
stats <- llply(tickers,
function(t) tryCatch(getKeyStats_xpath(t),
error=function(x) {
cat("error occurred for:\n", t, "\n...skipping this ticker\n")
}
)
)
names(stats) <- tickers
lapply(stats, length)
#<snip>
#$RCPT
#[1] 0
# </snip>
As of now, this works for me, returning data for all tickers except the one listed in the code block above.

Related

how to properly close connection so I won't get "Error in file(con, "r") : all connections are in use" when using "readlines" and "tryCatch"

I have a list of URLs (more than 4000) from a specific domain (pixilink.com) and what I want to do is to figure out if the provided domain is a picture or a video. To do this, I used the solutions provided here: How to write trycatch in R and Check whether a website provides photo or video based on a pattern in its URL and wrote the code shown below:
#Function to get the value of initial_mode from the URL
urlmode <- function(x){
mycontent <- readLines(x)
mypos <- grep("initial_mode = ", mycontent)
if(grepl("0", mycontent[mypos])){
return("picture")
} else if(grepl("tour", mycontent[mypos])){
return("video")
} else{
return(NA)
}
}
Also, in order to prevent having error for URLs that don't exist, I used the code below:
readUrl <- function(url) {
out <- tryCatch(
{
readLines(con=url, warn=FALSE)
return(1)
},
error=function(cond) {
return(NA)
},
warning=function(cond) {
return(NA)
},
finally={
message( url)
}
)
return(out)
}
Finally, I separated the list of URLs and pass it into the functions (here for instance, I used 1000 values from URL list) described above:
a <- subset(new_df, new_df$host=="www.pixilink.com")
vec <- a[['V']]
vec <- vec[1:1000] # only chose first 1000 rows
tt <- numeric(length(vec)) # checking validity of url
for (i in 1:length(vec)){
tt[i] <- readUrl(vec[i])
print(i)
}
g <- data.frame(vec,tt)
g2 <- g[which(!is.na(g$tt)),] #only valid url
dd <- numeric(nrow(g2))
for (j in 1:nrow(g2)){
dd[j] <- urlmode(g2[j,1])
}
Final <- cbind(g2,dd)
Final <- left_join(g, Final, by = c("vec" = "vec"))
I ran this code on a sample list of URLs with 100, URLs and it worked; however, after I ran it on whole list of URLs, it returned an error. Here is the error : Error in textConnection("rval", "w", local = TRUE) : all connections are in use Error in textConnection("rval", "w", local = TRUE) : all connections are in use
And after this even for sample URLs (100 samples that I tested before) I ran the code and got this error message : Error in file(con, "r") : all connections are in use
I also tried closeAllConnection after each recalling each function in the loop, but it didn't work.
Can anyone explain what this error is about? is it related to the number of requests we can have from the website? what's the solution?
So, my guess as to why this is happening is because you're not closing the connections that you're opening via tryCatch() and via urlmode() through the use of readLines(). I was unsure of how urlmode() was going to be used in your previous post so it had made it as simple as I could (and in hindsight, that was badly done, my apologies). So I took the liberty of rewriting urlmode() to try and make it a little bit more robust for what appears to be a more expansive task at hand.
I think the comments in the code should help, so take a look below:
#Updated URL mode function with better
#URL checking, connection handling,
#and "mode" investigation
urlmode <- function(x){
#Check if URL is good to go
if(!httr::http_error(x)){
#Test cases
#x <- "www.pixilink.com/3"
#x <- "https://www.pixilink.com/93320"
#x <- "https://www.pixilink.com/93313"
#Then since there are redirect shenanigans
#Get the actual URL the input points to
#It should just be the input URL if there is
#no redirection
#This is important as this also takes care of
#checking whether http or https need to be prefixed
#in case the input URL is supplied without those
#(this can cause problems for url() below)
myx <- httr::HEAD(x)$url
#Then check for what the default mode is
mycon <- url(myx)
open(mycon, "r")
mycontent <- readLines(mycon)
mypos <- grep("initial_mode = ", mycontent)
#Close the connection since it's no longer
#necessary
close(mycon)
#Some URLs with weird formats can return
#empty on this one since they don't
#follow the expected format.
#See for example: "https://www.pixilink.com/clients/899/#3"
#which is actually
#redirected from "https://www.pixilink.com/3"
#After that, evaluate what's at mypos, and always
#return the actual URL
#along with the result
if(!purrr::is_empty(mypos)){
#mystr<- stringr::str_extract(mycontent[mypos], "(?<=initial_mode\\s\\=).*")
mystr <- stringr::str_extract(mycontent[mypos], "(?<=\').*(?=\')")
return(c(myx, mystr))
#return(mystr)
#So once all that is done, check if the line at mypos
#contains a 0 (picture), tour (video)
#if(grepl("0", mycontent[mypos])){
# return(c(myx, "picture"))
#return("picture")
#} else if(grepl("tour", mycontent[mypos])){
# return(c(myx, "video"))
#return("video")
#}
} else{
#Valid URL but not interpretable
return(c(myx, "uninterpretable"))
#return("uninterpretable")
}
} else{
#Straight up invalid URL
#No myx variable to return here
#Just x
return(c(x, "invalid"))
#return("invalid")
}
}
#--------
#Sample code execution
library(purrr)
library(parallel)
library(future.apply)
library(httr)
library(stringr)
library(progressr)
library(progress)
#All future + progressr related stuff
#learned courtesy
#https://stackoverflow.com/a/62946400/9494044
#Setting up parallelized execution
no_cores <- parallel::detectCores()
#The above setup will ensure ALL cores
#are put to use
clust <- parallel::makeCluster(no_cores)
future::plan(cluster, workers = clust)
#Progress bar for sanity checking
progressr::handlers(progressr::handler_progress(format="[:bar] :percent :eta :message"))
#Website's base URL
baseurl <- "https://www.pixilink.com"
#Using future_lapply() to recursively apply urlmode()
#to a sequence of the URLs on pixilink in parallel
#and storing the results in sitetype
#Using a future chunk size of 10
#Everything is wrapped in with_progress() to enable the
#progress bar
#
range <- 93310:93350
#range <- 1:10000
progressr::with_progress({
myprog <- progressr::progressor(along = range)
sitetype <- do.call(rbind, future_lapply(range, function(b, x){
myprog() ##Progress bar signaller
myurl <- paste0(b, "/", x)
cat("\n", myurl, " ")
myret <- urlmode(myurl)
cat(myret, "\n")
return(c(myurl, myret))
}, b = baseurl, future.chunk.size = 10))
})
#Converting into a proper data.frame
#and assigning column names
sitetype <- data.frame(sitetype)
names(sitetype) <- c("given_url", "actual_url", "mode")
#A bit of wrangling to tidy up the mode column
sitetype$mode <- stringr::str_replace(sitetype$mode, "0", "picture")
head(sitetype)
# given_url actual_url mode
# 1 https://www.pixilink.com/93310 https://www.pixilink.com/93310 invalid
# 2 https://www.pixilink.com/93311 https://www.pixilink.com/93311 invalid
# 3 https://www.pixilink.com/93312 https://www.pixilink.com/93312 floorplan2d
# 4 https://www.pixilink.com/93313 https://www.pixilink.com/93313 picture
# 5 https://www.pixilink.com/93314 https://www.pixilink.com/93314 floorplan2d
# 6 https://www.pixilink.com/93315 https://www.pixilink.com/93315 tour
unique(sitetype$mode)
# [1] "invalid" "floorplan2d" "picture" "tour"
#--------
Basically, urlmode() now opens and closes connections only when necessary, checks for URL validity, URL redirection, and also "intelligently" extracts the value assigned to initial_mode. With the help of future.lapply(), and the progress bar from the progressr package, this can now be applied quite conveniently in parallel to as many pixilink.com/<integer> URLs as desired. With a bit of wrangling thereafter, the results can be presented very tidily as a data.frame as shown.
As an example, I've demonstrated this for a small range in the code above. Note the commented out 1:10000 range in the code in this context: I let this code run the last couple of hours over this (hopefully sufficiently) large range of URLs to check for errors and problems. I can attest that I encountered no errors (only the regular warnings In readLines(mycon) : incomplete final line found on 'https://www.pixilink.com/93334'). For proof, I have the data from all 10000 URLs written to a CSV file that I can provide upon request (I don't fancy uploading that to pastebin or elsewhere unnecessarily). Due to oversight on my part, I forgot to benchmark that run, but I suppose I could do that later if performance metrics are desired/would be considered interesting.
For your purposes, I believe you can simply take the entire code snippet below and run it verbatim (or with modifications) by just changing the range assignment right before the with_progress(do.call(...)) step to a range of your liking. I believe this approach is simpler and does away with having to deal with multiple functions and such (and no tryCatch() messes to deal with).

quantmod getsymbols error handling

I am trying to cycle through a list of symbols and download the data and save it to a csv file. individual stocks work perfectly fine if it is there, but it stops if there is an error and I do not know how to handle errors (new to R) I used part of an answer here, but I am unable to find answer on error handling while looping it to save it to file.
quantmod omitting tickers in getSymbols
startDate = Sys.Date()- 365
pth = "C:\\"
tickers <- c("LMT","AAPL","AMT", "GOOG")
#the sapply method works by not stopping when it has issues with LMT and still it goes not to dwld AAPL,
library(quantmod)
WoW <- new.env()
sapply(tickers, function(x){
try(
getSymbols(
x,
src ="google",
from =startDate,
env=WoW),
silent=TRUE)
})
#Now for the looping to save to file, somehow it does not go althe way till GOOG. it stops at AAPL
#Error in data.frame(sym) : row names contain missing values.
for (i in 1:length(tickers) ) {
col <- c( "Open","High","Low","Close","Volume")
sym <- eval(parse(text=paste("WoW$",tickers[i],sep="")))
if (!is.null(nrow(sym))){
colnames(sym) <- col
sym <- data.frame(sym)
sym <- cbind(BizDay = 0, sym)
sym$BizDay <- rownames(sym)
op <- paste0(pth,tickers[i],".csv")
print(op)
write.table(sym, file=op, na="", sep=",", row.names = FALSE)
}
}
Any pointers on how to handle basic errors? I have to run through full security list, and have to make sure that i handle those. but right now stuck on this.
Thanks
Got it to work with nrow(sym) > 1 check.

How to download data from web? Need to skip and move to next item if data doesn't exist

I’m trying to modify a script that I found online. I want to get the script to skip a ticker if no results are found, kind of like Try…Catch. Now, the code just falls down if a ticker isn’t found. How can I get this to skip any/all missing tickers and finish without crashing?
require(XML)
require(plyr)
getKeyStats_xpath <- function(symbol) {
yahoo.URL <- "http://finance.yahoo.com/q/ks?s="
html_text <- htmlParse(paste(yahoo.URL, symbol, sep = ""), encoding="UTF-8")
#search for <td> nodes anywhere that have class 'yfnc_tablehead1'
nodes <- getNodeSet(html_text, "/*//td[#class='yfnc_tablehead1']")
if(length(nodes) > 0 ) {
measures <- sapply(nodes, xmlValue)
#Clean up the column name
measures <- gsub(" *[0-9]*:", "", gsub(" \\(.*?\\)[0-9]*:","", measures))
#Remove dups
dups <- which(duplicated(measures))
#print(dups)
for(i in 1:length(dups))
measures[dups[i]] = paste(measures[dups[i]], i, sep=" ")
#use siblings function to get value
values <- sapply(nodes, function(x) xmlValue(getSibling(x)))
df <- data.frame(t(values))
colnames(df) <- measures
return(df)
} else {
break
}
}
setwd("C:/Users/rshuell001/Desktop/downloads/")
tickers <- c("OREX",
"OSIR",
"PACB",
"PCRX",
"PCYC",
"PDLI",
"PETX",
"PGNX",
"POZN",
"PRTA",
"PTCT",
"PTLA",
"PTX",
"QGEN",
"QLTI",
"RARE",
"RCPT",
"RDUS",
"REGN",
"RGEN",
"RGLS",
"RLYP",
"RPTP",
"RTRX",
"RVNC",
"SAGE")
stats <- ldply(tickers, getKeyStats_xpath)
rownames(stats) <- tickers
write.csv(t(stats), "FinancialStats_updated.csv",row.names=TRUE)
I posted here about a week ago and a couple people suggested I try this.
stats <- llply(tickers, function(t) tryCatch(getKeyStats_xpath(t), error=function(x) {cat("error occurred, skipping this ticker\n")}))
rownames(stats) <- tickers
write.csv(t(stats), "FinancialStats_updated.csv",row.names=TRUE)
That never worked. I messed around with it for a while, but couldn’t get it to write the data to a CSV file. Although, the code looks like it’s pretty close to what I want.
Can someone help me get this straightened out?
Thanks!!
Here is the error your code produces:
Error in FUN(X[[i]], ...) :
no loop for break/next, jumping to top level
Let's look at help(break):
break breaks out of a for, while or repeat loop; control is transferred to the first statement outside the inner-most loop
So, break() may only be used in a for, while or repeat loop, and you are using it in an if else statement... We could just replace that break in getKeyStats_xpath by something else. Plus return something to mark that there was a symbol not found.
...
} else {
cat("Could not find",symbol,"\n")
return(data.frame(NA))
}
...
It now works fine. You can remove the rows that are all NAs, after you have assigned names to stats, with:
stats <- stats[!rowSums(is.na(stats)) == length(stats),]

implementing tryCatch R

Trying to use tryCatch. What I want is to run through a list of urls that I have stored in page1URLs and if there is a problem with one of them (using readHTMLTable() )I want a record of which ones and then I want the code to go on to the next url without crashing.
I think I don't have the right idea here at all. Can anyone suggest how I can do this?
Here is the beginning of the code:
baddy <- rep(NA,10,000)
badURLs <- function(url) { baddy=c(baddy,url) }
writeURLsToCsvExtrema(38.361042, 35.465144, 141.410522, 139.564819)
writeURLsToCsvExtrema <- function(maxlat, minlat, maxlong, minlong) {
urlsFuku <- page1URLs
allFuku <- data.frame() # need to initialize it with column names
for (url in urlsFuku) {
tryCatch(temp.tables=readHTMLTable(url), finally=badURLs(url))
temp.df <- temp.tables[[3]]
lastrow <- nrow(temp.df)
temp.df <- temp.df[-c(lastrow-1,lastrow),]
}
One general approach is to write a function that fully processes one URL, returning either the computed value or NULL to indicate failure
FUN = function(url) {
tryCatch({
xx <- readHTMLTable(url) ## will sometimes fail, invoking 'error' below
## more calculations
xx ## final value
}, error=function(err) {
## what to do on error? could return conditionMessage(err) or other...
NULL
})
}
and then use this, e.g., with a named vector
urls <- c("http://cran.r-project.org", "http://stackoverflow.com",
"http://foo.bar")
names(urls) <- urls # add names to urls, so 'result' elements are named
result <- lapply(urls, FUN)
These guys failed (returned NULL)
> names(result)[sapply(result, is.null)]
[1] "http://foo.bar"
And these are the results for further processing
final <- Filter(Negate(is.null), result)

Add a column to every object currently loaded in the R environment

I would like to add a column to every data frame in my R environment which all have the same format.
I can create the column I want with a simple assignment like this:
x[,8] <- x[,4]/(x[,4]+x[,5])
When I try to put this in a for loop that will iterate over every object in the environment, I get an error.
control_data <- ls()
for (i in control_data) {(i[,8] <- i[,4]/(i[,4]+i[,5]))}
Error: unexpected '[' in "for (i in control_data) {["
Here is what the input files look like:
ENSMUSG00000030088 Aldh1l1 chr6:90436420-90550197 1.5082200 3.130860 0.671814 0.0000000
ENSMUSG00000020932 Gfap chr11:102748649-102762226 7.0861500 44.182700 20.901700 0.2320750
ENSMUSG00000024411 Aqp4 chr18:15547902-15562193 3.4920400 3.474880 2.463230 0.0331238
ENSMUSG00000023913 Pla2g7 chr17:43705046-43749150 1.5105400 24.275600 11.422400 1.5111100
ENSMUSG00000035805 Mlc1 chr15:88786313-88809437 1.9010200 7.147400 5.313190 0.6358940
ENSMUSG00000007682 Dio2 chr12:91962993-91976878 1.7322900 12.094200 6.738320 1.0736900
ENSMUSG00000017390 Aldoc chr11:78136469-78141283 55.4562000 199.958000 91.328300 22.9541000
ENSMUSG00000005089 Slc1a2 chr2:102498815-102630941 63.7394000 130.729000 103.710000 10.0406000
ENSMUSG00000070880 Gad1 chr2:70391128-70440071 2.6501400 14.907500 13.730200 1.3992200
ENSMUSG00000026787 Gad2 chr2:22477724-22549394 3.9908200 11.308600 28.221500 1.4530500
Thank you for any help you could provide. Is there a better way to do this using an apply function?
As mentioned in the comment, your error happens because the results of calling ls are not the objects themselves but rather their names as strings.
To use the for-loop, you'll be headed down the eval(parse(...)) path. You can also do this with apply and a function.
myfun <- function(x) {
df <- get(x)
df[,8] <- df[,4] / (df[,4] + df[,5])
return(df)
}
control_data <- ls()
lapply(control_data, myfun)
As per the comment:
for(i in control_data) {
df <- get(i)
df[,8] <- df[,4] / (df[,4] + df[,5])
assign(i, df)
}

Resources