I have a loop that reads HTML table data from ~ 440 web pages. The code on each page is not exactly the same, so sometimes I need table node 1 and sometime I need node 2. Right now I've just been setting the node number manually in a list and feeding it into the loop. My problem is that the page nodes have started changing and updating the node # list is getting to be a hassle.
If the loop encounters the wrong node # (ie: 1 instead of 2, or reverse) it gives an error and shuts down. Is there a way to have the loop replace the erroneous node number to the correct one if it encounters an error, and then keep running the loop as if nothing happened?
Here's the readHTML portion of the code in my loop with an example url:
url <- "http://espn.go.com/nba/player/gamelog/_/id/2991280/year/2013/"
html.page <- htmlParse(url)
tableNodes <- getNodeSet(html.page, "//table")
x <- as.numeric(Players$Nodes[s])
tbl = readHTMLTable(tableNodes[[x]], colClasses = c("character"),stringsAsFactors = FALSE)
Here's the error I get when the node # is wrong:
"Error in readHTMLTable(tableNodes[[x]], colClasses = c("character"), stringsAsFactors = FALSE) : error in evaluating the argument 'doc' in selecting a method for function 'readHTMLTable': Error in tableNodes[[x]] : subscript out of bounds"
Example code:
A <- c("dog", "cat")
Nodes <- as.data.frame(1:1)
#)Nodes <- as.data.frame(1:2) <-- This works without errors
colnames(Nodes)[1] <- "Col1"
Nodes2 <- 2
url <-c("http://espn.go.com/nba/player/gamelog/_/id/6639/year/2013/","http://espn.go.com/nba/player/gamelog/_/id/6630/year/2013/")
for (i in 1:length(A))
{
html.page <- htmlParse(url[i])
tableNodes <- getNodeSet(html.page, "//table")
x <- as.numeric(Nodes$Col1[i])
df = readHTMLTable(tableNodes[[x]], colClasses = c("character"),stringsAsFactors = FALSE)
#tryCatch(df) here.....no clue
assign(paste0("", A[i]), df)
}
If you get subscript out of bounds error msg, then you should try to with a lower x for sure. General demo with tryCatch based on the demo code you posted in the original question (although I have replaced x with 2 as I have no idea what is Players and s):
> msg <- tryCatch(readHTMLTable(tableNodes[[2]], colClasses = c("character"),stringsAsFactors = FALSE), error = function(e)e)
> str(msg)
List of 2
$ message: chr "error in evaluating the argument 'doc' in selecting a method for function 'readHTMLTable': Error in tableNodes[[2]] : subscript"| __truncated__
$ call : language readHTMLTable(tableNodes[[2]], colClasses = c("character"), stringsAsFactors = FALSE)
- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
> msg$message
[1] "error in evaluating the argument 'doc' in selecting a method for function 'readHTMLTable': Error in tableNodes[[2]] : subscript out of bounds\n"
> grepl('subscript out of bounds', msg$message)
[1] TRUE
Related
I have a function that does some basic web harvesting. This function is called after a successful login. (website has been masked xxxxxx)
Search Function:
search <-function(HorseList){
url <- "http://tnetwork.xxxxxx.com/tnet/HorseSearch.aspx"
s <- GET(url)
xxxxxx <- tibble(
horse_name = character(),
race_date = character(),
race_nbr = character(),
trk = character(),
peak = character(),
dist_run = character()
)
for (row in 1:nrow(HorseList))
{
url <-paste(c('http://tnetwork.xxxxxx.com/tnet/HorseSearchAPI.aspx?HorseName=',toString(HorseList[[row, 1]])),collapse='')
#print(url)
h <- GET(url)
temp<-content(h, "text")
doc <- htmlParse(temp)
horse_name <- HorseList[[row,1]]
horse_ID <-xpathSApply(doc,"//*[#id=\"resultsDiv\"]/p[1]/a/#href")
horse_ID <-substr(horse_ID,27,40)
h_list <- list()
c <- nchar(horse_ID)
if (length(c)>0)
{
h_list[1] <- horse_ID
}
id_count <- length(h_list)
for (k in 1:id_count)
{
url <-paste(c('http://tnetwork.xxxxxx.com/tnet/t_PastPerf.aspx?HorseID=',toString(h_list[k])),collapse='')
t <- GET(url)
temp <- content(t, "text")
pastperf <- htmlParse(temp)
row_count<-length(xpathSApply(pastperf,"//*[#id=\"pastPerfTable\"]/tr"))
for(j in 2:row_count)
{
j<- toString(j)
race_data <- xpathSApply(pastperf,paste("//*[#id=\"pastPerfTable\"]/tr[",j,"]/td[1][1]"),xmlValue)
race_date <- substr(race_data,1,10)
race_number <-trimws(substr(race_data,12,100))
horse_name <- URLdecode(toString(horse_name))
race_nbr = str_match(race_number,'(Race\\s\\d+)(.*)')[,2]
trk = str_match(race_number,'(Race\\s\\d+)(.*)')[,3]
peak <-xpathSApply(pastperf,paste("//*[#id=\"pastPerfTable\"]/tr[",j,"]/td[13]"),xmlValue)
cum_distance <-xpathSApply(pastperf,paste("//*[#id=\"pastPerfTable\"]/tr[",j,"]/td[14]"),xmlValue)
newrow <- paste(horse_name,',',race_date,',',race_nbr,',',trk, ',',peak,',',cum_distance)
xxxxxx <- add_row(trakus, horse_name = horse_name, race_date = race_date, race_nbr = race_nbr, trk=trk, peak = peak, dist_run = cum_distance)
}
}
}
return(xxxxxx)
}
The function has worked successfully in the past, but today it is throwing the following error:
Error: Internal error in `vec_assign()`: `value` should have been recycled to fit `x`.
I ran the rlang::last_error() and last_trace() commands to gain some additional insight, but I'm still not sure what's going on.
> rlang::last_error()
<error/rlang_error>
Internal error in `vec_assign()`: `value` should have been recycled to fit `x`.
Backtrace:
1. base::source("~/TimeForm/Scripts/past_perf.R", echo = TRUE)
6. global::search(horse_list) ~/TimeForm/Scripts/past_perf.R:627:2
7. tibble::add_row(...) ~/TimeForm/Scripts/past_perf.R:85:8
8. tibble:::rbind_at(.data, df, pos)
9. vctrs::vec_rbind(old, new)
Run `rlang::last_trace()` to see the full context.
> rlang::last_trace()
<error/rlang_error>
Internal error in `vec_assign()`: `value` should have been recycled to fit `x`.
Backtrace:
x
1. +-base::source("~/TimeForm/Scripts/past_perf.R", echo = TRUE)
2. +-base::source("~/TimeForm/Scripts/past_perf.R", echo = TRUE)
3. | +-base::withVisible(eval(ei, envir))
4. | \-base::eval(ei, envir)
5. | \-base::eval(ei, envir)
6. \-global::search(horse_list) ~/TimeForm/Scripts/past_perf.R:627:2
7. \-tibble::add_row(...) ~/TimeForm/Scripts/past_perf.R:85:8
8. \-tibble:::rbind_at(.data, df, pos)
9. \-vctrs::vec_rbind(old, new)
10. \-(function () ...
It appears the add_row() line in my code may be the culprit, but I'm not sure what the error is telling or how to fix it. Does anyone have any insights they could share?
I found that the problem occurs in the handling of empty character fields, null values or the use of NA. The problem was corrected with a map_if replacement of null values.
map_depth(.depth = 1, map_if, is_empty, ~paste0(""))
Of course, you'll need to adjust the depth command to make the correction at the appropriate level for your list construct.
Ideally, bind_rows() would handle vectors containing NA or NULL values more robustly.
I'll explain the end goal, and what I'm trying as a test first. (Because I'm likely going about it the wrong way.)
I am using the phyloseq package to visualize microbiome data. I want to "automate" it to an extent by having users choose levels of analysis and have my script generate the visualizations without someone hand typing in each combination.
The issue is passing variables into the subset function. I get these errors primarily (depending on what combinations of paste0, eval, parse, as.logical, expression, noquote....etc that i've tried):
Error in subset.data.frame(oldDF, ...) : 'subset' must be logical
Error in dimnames(x) <- dn :
length of 'dimnames' [1] not equal to array extent
A user would set the levels of analysis. So lets say for now there are two levels, and selecting the second level automatically means you want the first level as well. (I haven't worked on that part yet, but I wanted to explain it upfront.
#Set lineage level
lin_level <- 1
lin_list <- c("k__Kingdom", "p__Phylum","c__Class", "o__Order","f__Family")
lin_select <- lin_list[lin_level]
sub_lin <- lin_list[(lin_level +1)]
#Kingdom
king_list <- "k__Bacteria"
#set Phylum list
if (lin_select == "p__Phylum"){
phylum_list <- c("p__Firmicutes","p__Proteobacteria","p__Bacteroidetes","p__Actinobacteria","p__Tenericutes")
}
subgroup <- "All"
From here, the script would ultimately get to the graphing section. If lin_level is set to 1, it would look like this:
FIXED
gphic = subset_taxa(physeq1, Kingdom=="k__Bacteria")
title = paste0(subgroup," ", "Bacteria-only")
plot_bar(gpsfb, "Phylum", "Abundance", "Phylum",
title=title, facet_grid="Type~.")
AUTOMATED
gphic = subset_taxa(physeq1, (substring(lin_select,4)) == king_list)
title = paste0(subgroup," ", (substring(king_list,4)),"-only")
plot_bar(gpsfb, (substring(sub_lin,4)), "Abundance", (substring(sub_lin,4)),
title=title, facet_grid="Type~.")
But, trying to pass (substring(lin_select,4)) == king_list as an argument results in errors.
I've searched through the various threads on this issue, but haven't been able to get the different answers to work. Ultimately I need to run the graphing section once for Kingdom, and then again each time for each item in the Phylum list. But before i can get there, I need to be able to pass the arguments into the subset function.
Things I've tried:
test <- paste0(substring(lin_select,4),"==","\"","p__Bacteroidetes","\"")
noquote(test)
[1] Phylum=="p__Bacteroidetes"
gphic = subset_taxa(physeq1, noquote(test))
Error in subset.data.frame(oldDF, ...) : 'subset' must be logical
gphic = subset_taxa(physeq1, paste0(substring(lin_select,4),"==","\"","p__Bacteroidetes","\""))
Error in subset.data.frame(oldDF, ...) : 'subset' must be logical
gphic = subset_taxa(physeq1, as.logical(test))
Error in dimnames(x) <- dn :
length of 'dimnames' [1] not equal to array extent
as.logical(noquote(test))
[1] NA
gphic = subset_taxa(physeq1, as.logical(noquote(test)))
Error in dimnames(x) <- dn :
length of 'dimnames' [1] not equal to array extent
noquote(test)
[1] Phylum=="p__Bacteroidetes"
as.logical(noquote(test))
[1] NA
as.logical(as.character(noquote(test)))
[1] NA
test2 <- eval(parse(text= test))
Error in eval(parse(text = test)) : object 'Phylum' not found
test2 <- eval(test)
gphic = subset_taxa(physeq1, as.logical(test2))
Error in dimnames(x) <- dn :
length of 'dimnames' [1] not equal to array extent
as.logical(test2)
[1] NA
And a lot of other permutations trying to sub in different things, but you get the idea.
gphic = subset_taxa(physeq1, eval(as.name(level_tax)) == king_list)
Here , level_tax is the variable in a loop. Say level_tax = "Order", then we convert the string "Order" into variable name by as.name(level_tax) or as.symbol(level_tax). Then we use eval(), which takes an expression and evaluates in the specified environment
Lately when I run my code that uses coxph in the survival package
coxph(frml,data = data), I am now getting warning messages of the following type
1: In model.matrix.default(Terms, mf, contrasts = contrast.arg) :
partial argument match of 'contrasts' to 'contrasts.arg'
2: In seq.default(along = temp) :
partial argument match of 'along' to 'along.with'"
I'm not exactly sure why all of a sudden these partial argument match warnings started popping up, but I don't think they effect me.
However, when I get the following warning message, I want coxph(frml,data = data) = NA
3: In fitter(X, Y, strats, offset, init, control, weights = weights, :
Loglik converged before variable 2 ; beta may be infinite.
6: In coxph(frml, data = data) :
X matrix deemed to be singular; variable 1 3 4
I used tryCatch when I wasn't getting the partial argument match warning using this code where if the nested tryCatch got either a warning or error message it would return NA
coxphfit = tryCatch(tryCatch(coxph(frml,data = data), error=function(w) return(NA)), warning=function(w) return(NA))
However, now that I am getting the partial argument match warnings, I need to only return an NA if there is an error or if I get the above warning messages 3 and 4 . Any idea about how to capture these particular warning messages and return an NA in those instances?
It's actually interesting question, if you are looking for quick and dirty way of capturing warnings you could simply do:
withCallingHandlers({
warning("hello")
1 + 2
}, warning = function(w) {
w ->> w
}) -> res
In this example the object w created in parent environment would be:
>> w
<simpleWarning in withCallingHandlers({ warning("hello") 1 + 2}, warning = function(w) { w <<- w}): hello>
You could then interrogate it:
grepl(x = w$message, pattern = "hello")
# [1] TRUE
as
>> w$message
# [1] "hello"
Object res would contain your desired results:
>> res
[1] 3
It's not the super tidy way but I reckon you could always reference object w and check if the warning message has the phrase you are interested in.
My objective here is to capture the error that R throws and store it in an object.
Here are some dummy codes:
for(i in 1:length(a)){try(
if (i==4)(print(a[i]/"b"))else(print(a[i]/b[i]))
)}
[1] -0.125
[1] -0.2857143
[1] -0.5
Error in a[i]/"b" : non-numeric argument to binary operator
[1] -1.25
[1] -2
[1] -3.5
[1] -8
[1] Inf
[1] 10
So I want to capture that on the 4th iteration the error was: Error in a[i]/"b" : non-numeric argument to binary operator into an object say:
error<-()
iferror(error[i]<-geterrmessage())
I am aware that iferror as a function is not available in R, but I am trying to give the idea, because geterrmessage captures only the last error it sees
So for the example i want say for error[1:3]<-'NA'and error[5:10]<-'NA' because no error but
error[4]<-"Error in a[i]/"b" : non-numeric argument to binary operator"
So that later I can check error object and understand where and what error happened
If you can help me write a code that would be excellent and highly appreciated
I hope the following function helps:
a <- c(0:6)
b <- c(-3:3)
create_log <- function(logfile_name, save_path) {
warning("Error messages not visible. Use closeAllConnections() in the end of the script")
if (file.exists(paste0(save_path, logfile_name))) {
file.remove(paste0(save_path, logfile_name))
}
fid <- file(paste0(save_path, logfile_name), open = "wt")
sink(fid, type = "message", split = F) # warnings are NOT displayed. split=T not possible.
sink(fid, append = T, type = "output", split = T) # print, cat
return(NULL)
}
create_log("test.csv", "C:/Test/")
for(i in 1:length(a)){try(
if (i==4)(print(a[i]/"b"))else(print(a[i]/b[i]))
)}
closeAllConnections()
I am currently facing an error mentioned below which is related to NULL values being coerced to a data frame. The data set does contain nulls, however I have tried both is.na() and is.null() functions to replace the null values with something else. The data is stored on hdfs and is stored in a pig.hive format. I have also attached the code below. The code works fine if I remove v[,25] from the key.
Code:
AM = c("AN");
UK = c("PP");
sample.map <- function(k,v){
key <- data.frame(acc = v[!which(is.na(v[,1],1],
year = substr(v[!which(is.na(v[,1]),2],1,4),
month = substr(v[!which(is.na(v[,1]),2],5,6))
value <- data.frame(v[,3],count=1)
keyval(key,value)
}
sample.reduce <- function(key,v){
AT <- sum(v[which(v[,1] %in% AM=="TRUE"),2])
UnknownT <- sum(v[which(v[,1] %in% UK=="TRUE"),2])
Total <- AT + UnknownT
d <- data.frame(AT,UnknownT,Total)
keyval(key,d)
}
out <- mapreduce(input ="/user/hduser/input",
output = "/user/hduser/output",
input.format = make.input.format("pig.hive", sep = "\u0001")
output.format = make.output.format("csv", sep = ","),
map= sample.map)
reduce = sample.reduce)
Error:
Warning in asMethod(object) : NAs introduced by coercion
Warning in split.default(1:rmr.length(y), unique(ind), drop = TRUE) : data length is not a multiple of split variable
Warning in rmr.split(x, x, FALSE, keep.rownames = FALSE) : number of items to replace is not a multiple of replacement length Warning in split.default(1:rmr.length(y), unique(ind), drop = TRUE) :
data length is not a multiple of split variable
Warning in rmr.split(v, ind, lossy = lossy, keep.rownames = TRUE) : number of items to replace is not a multiple of replacement length
Error in as(x, class(k)) :
no method or default for coercing “NULL” to “data.frame”
Calls: <Anonymous> ... apply.reduce -> c.keyval -> reduce.keyval -> lapply -> FUN -> as No traceback available
UPDATE
I have added the sample data and edited the code above. Hope this helps!
Sample Data:
NULL,"2014-03-14","PP"
345689202,"2014-03-14","AN"
234539390,"2014-03-14","PP"
123125444,"2014-03-14","AN"
NULL,"2014-03-14","AN"
901828393,"2014-03-14","AN"
There are some issues with as which have been identified recently. I don't see why as can't handle this by default, but you can modify coerce which handles the conversion with an S4 method to call as.data.frame.
setMethod("coerce",c("NULL","data.frame"), function(from, to, strict=TRUE) as.data.frame(from))
[1] "coerce"
as(NULL,"data.frame")
data frame with 0 columns and 0 rows