I am writting a code in R to read a list of CSV files from 6000 URLs.
data <- read.csv(url)
If R can not acces a URL, the code execution stops. Anyone knows hot to avoid this error stop in R?
I have been looking for any argument for the read.csv functionm but probably there is a function.
Simply use tryCatch to catch the error inside the loop but continue on to other iterations:
# DEFINED METHOD, ON ERROR PRINTS MESSAGE AND RETURNS NULL
read_data_from_url <- function(url) {
tryCatch({
read.csv(url)
}, error = function(e) {
print(e)
return(NULL)
})
}
# NAMED LIST OF DATA FRAMES
df_list <- sapply(
list_of_6000_urls, read_data_from_url, simplify = FALSE
)
# FILTER OUT NULLS (PROBLEMATIC URLS)
df_list <- Filter(NROW, df_list)
Related
I have a dataframe and trying to execute a shapiro-wilk test in multiples columns.
When a try to use the following code:
DF.Shapiro <- do.call(rbind, lapply(DF[c(3:41)], function(x) shapiro.test(x)[c("statistic", "p.value")]))
Always appears this message:
"Error in shapiro.test(x) : all 'x' values are identical"
Or
"Error in FUN(X[[i]], ...) : all 'x' values are identical"
How can a solve this?
Without data it's difficult to say but maybe the following untested solution will do what the question asks for.
num <- which(sapply(DF[3:41], is.numeric))
num <- intersect(3:41, num)
do.call(
rbind.data.frame,
lapply(DF[num], function(x){
tryCatch(shapiro.test(x)[c("statistic", "p.value")],
error = function(e) e)
})
)
Edit
If some of the tests return an error, the lapply instruction will return different types of data and the rbind.data.frame method will also give an error. The following code solves that problem by saving the lapply results in a named list, test_list and checking the list members for errors before binding the right ones.
test_list <- lapply(DF[num], function(x){
tryCatch(shapiro.test(x)[c("statistic", "p.value")],
error = function(e) e)
})
err <- sapply(test_list, inherits, "error")
err_list <- test_list[err]
do.call(rbind.data.frame, test_list[!err])
I am trying to write a function that cleans spreadsheets. However, some of the spreadsheets are corrupted and will not open. I want the function to recognize this, print an error message, and skip execution of the rest of the function (since I am using lapply() to iterate across files), and continues. My current attempt looks like this:
candidate.cleaner <- function(filename){
#this function cleans candidate data spreadsheets into an R dataframe
#dependency check
library(readxl)
#read in
cand_df <- tryCatch(read_xls(filename, col_names = F),
error = function (e){
warning(paste(filename, "cannot be opened; corrupted or does not exist"))
})
print(filename)
#rest of function
cand_df[1,1]
}
test_vec <- c("test.xls", "test2.xls", "test3.xls")
lapply(FUN = candidate.cleaner, X = test_vec)
However, this still executes the line of the function after the tryCatch statement when given a .xls file that does not exist, which throws a stop since I'm attempting to index a dataframe that doesn't exist. This exits the lapply call. How can I write the tryCatch call to make it skip execution of the rest of the function without exiting lapply?
One could set a semaphore at the start of the tryCatch() indicating that things have gone OK so far, then handle the error and signal that things have gone wrong, and finally check the semaphore and return from the function with an appropriate value.
lapply(1:5, function(i) {
value <- tryCatch({
OK <- TRUE
if (i == 2)
stop("stopping...")
i
}, error = function(e) {
warning("oops: ", conditionMessage(e))
OK <<- FALSE # assign in parent environment
}, finally = {
## return NA on error
OK || return(NA)
})
## proceed
value * value
})
This allows one to continue using the tryCatch() infrastructure, e.g., to translate warnings into errors. The tryCatch() block encapsulates all the relevant code.
Turns out, this can be accomplished in a simple way with try() and an additional help function.
candidate.cleaner <- function(filename){
#this function cleans candidate data spreadsheets into an R dataframe
#dependency check
library(readxl)
#read in
cand_df <- try(read_xls(filename, col_names = F))
if(is.error(cand_df) == T){
return(list("Corrupted: rescrape", filename))
} else {
#storing election name for later matching
election_name <- cand_df[1,1]
}
}
Where is.error() is taken from Hadley Wickham's Advanced R chapter on debugging. It's defined as:
is.error <- function(x) inherits(x, "try-error")
I'm trying to add some clarity to which file in a long list is throwing an error. I've tried wrapping the for loop in tryCatch(), but I can't get the behavior I'm looking for. The result I'm going for is: if file i throws an error sprintf("Error: %s has a formatting problem", i).
Below, we create a directory with three files, two of which should be read correctly into a list by the for loop, one which will throw an error because it's a xlsx file.
Note that the below code will create and delete a directory and three files
dir.create("Demo999")
setwd("Demo999")
write.csv(mtcars,"mtcars.csv")
xlsx::write.xlsx(mtcars,"mtcars.xlsx")
write.csv(mtcars,"mtcars2.csv")
files <- list.files()
data <- list()
for (i in files){
f <- read.csv(i)
data[[i]] <- f
}
# clean up generated files
setwd("..")
unlink("Demo999", recursive= TRUE, force= TRUE)
My desired output is:
"Error: mtcars.xlsx has a formatting problem."
This code does not run, but is a sample tryCatch block:
tryCatch({
for (i in files){
f <- read.csv(i)
data[[i]] <- f
}
}, error = function() sprintf("Error: %s has a formatting problem", i))
Put the tryCatch inside the loop, not outside the loop. You don't want try the whole loop, and do something else if the loop fails; you want the loop to try each list element, and print an error if it fails. This way you can go back and re-attempt only the failed files. Try this:
for (i in files){
tryCatch({
f <- read.csv(i)
data[[i]] <- f
},
error = function(e) print(sprintf("Error: %s has a formatting problem", i))
)
}
# [1] "Error: mtcars.xlsx has a formatting problem"
names(data)
# [1] "mtcars.csv" "mtcars2.csv"
Note that, in this nice example, you know what succeeded from the names(data), so you can easily find fails = setdiff(files, names(data)).
I'm running this function:
require(XML)
require(plyr)
getKeyStats_xpath <- function(symbol) {
yahoo.URL <- "http://finance.yahoo.com/q/ks?s="
html_text <- htmlParse(paste(yahoo.URL, symbol, sep = ""), encoding="UTF-8")
#search for <td> nodes anywhere that have class 'yfnc_tablehead1'
nodes <- getNodeSet(html_text, "/*//td[#class='yfnc_tablehead1']")
if(length(nodes) > 0 ) {
measures <- sapply(nodes, xmlValue)
#Clean up the column name
measures <- gsub(" *[0-9]*:", "", gsub(" \\(.*?\\)[0-9]*:","", measures))
#Remove dups
dups <- which(duplicated(measures))
#print(dups)
for(i in 1:length(dups))
measures[dups[i]] = paste(measures[dups[i]], i, sep=" ")
#use siblings function to get value
values <- sapply(nodes, function(x) xmlValue(getSibling(x)))
df <- data.frame(t(values))
colnames(df) <- measures
return(df)
} else {
break
}
}
As long as the page exists, it works fine. However, if one of my tickers does NOT have any data on that URL, it throws an error:
Error in FUN(X[[3L]], ...) : no loop for break/next, jumping to top level
I added a trace too, and things break down on ticker number 3.
tickers <- c("QLTI",
"RARE",
"RCPT",
"RDUS",
"REGN",
"RGEN",
"RGLS")
tryCatch({
stats <- ldply(tickers, getKeyStats_xpath)
}, finally={})
I'd like to call the function like this:
stats <- ldply(tickers, getKeyStats_xpath)
rownames(stats) <- tickers
write.csv(t(stats), "FinancialStats_updated.csv",row.names=TRUE)
Basically, if a ticker has no data, I want to skip it.
Can someone please help me get this working?
Expanding on my comment. The issue here is you've enclosed the entire command stats <- ldply(tickers, getKeyStats_xpath) within a tryCatch. This means R will try to get key stats from every ticker.
Instead, what you want is to try each ticker.
To do this, write a wrapper for getKeyStats_xpath that encloses it in tryCatch. you could do this within ldply with an anonymous function, for example ldply(tickers, function (t) tryCatch(getKeyStats_xpath(t), finally={})). Note that finally executes regardless of exit condition, so finally={} executes nothing. (See Advanced R or How to write try catch in R from r-faq for more).
On an error, tryCatch calls the function provided in the argument error. So as is, this code still won't help as the error is unhandled (thanks to rawr for pointing this out earlier). It is also easier to inspect the output if you use llply instead, then
So a complete answer using this approach, and with informative error handling, is below.
stats <- llply(tickers,
function(t) tryCatch(getKeyStats_xpath(t),
error=function(x) {
cat("error occurred for:\n", t, "\n...skipping this ticker\n")
}
)
)
names(stats) <- tickers
lapply(stats, length)
#<snip>
#$RCPT
#[1] 0
# </snip>
As of now, this works for me, returning data for all tickers except the one listed in the code block above.
Trying to use tryCatch. What I want is to run through a list of urls that I have stored in page1URLs and if there is a problem with one of them (using readHTMLTable() )I want a record of which ones and then I want the code to go on to the next url without crashing.
I think I don't have the right idea here at all. Can anyone suggest how I can do this?
Here is the beginning of the code:
baddy <- rep(NA,10,000)
badURLs <- function(url) { baddy=c(baddy,url) }
writeURLsToCsvExtrema(38.361042, 35.465144, 141.410522, 139.564819)
writeURLsToCsvExtrema <- function(maxlat, minlat, maxlong, minlong) {
urlsFuku <- page1URLs
allFuku <- data.frame() # need to initialize it with column names
for (url in urlsFuku) {
tryCatch(temp.tables=readHTMLTable(url), finally=badURLs(url))
temp.df <- temp.tables[[3]]
lastrow <- nrow(temp.df)
temp.df <- temp.df[-c(lastrow-1,lastrow),]
}
One general approach is to write a function that fully processes one URL, returning either the computed value or NULL to indicate failure
FUN = function(url) {
tryCatch({
xx <- readHTMLTable(url) ## will sometimes fail, invoking 'error' below
## more calculations
xx ## final value
}, error=function(err) {
## what to do on error? could return conditionMessage(err) or other...
NULL
})
}
and then use this, e.g., with a named vector
urls <- c("http://cran.r-project.org", "http://stackoverflow.com",
"http://foo.bar")
names(urls) <- urls # add names to urls, so 'result' elements are named
result <- lapply(urls, FUN)
These guys failed (returned NULL)
> names(result)[sapply(result, is.null)]
[1] "http://foo.bar"
And these are the results for further processing
final <- Filter(Negate(is.null), result)