Errors when parallelizing for loop - r

I'm new to parallel-processing and attempting to parallelize a for loop in which I create new columns in a data frame by matching a column in said data frame with two other data frames. j, the data frame I'm attempting to create columns in is 400000 x 54. a and c, the two data frames I'm matching j with are 5000 x 12 and 45000 x 8 (respectively).
Below is my initial loop prior to the attempt at parallelizing:
for(i in 1:nrow(j)) {
if(j$Inspection_Completed[i] == TRUE) {
next
}
j$Assigned_ID <- a$Driver[match(j$car_name, a$CarName)]
j$Title <- c$Title[match(j$Site_ID, c$LocationID)]
j$Status <- c$Status[match(j$Site_ID, c$LocationID)]
}
So far I have attempted the following:
cl <- snow::makeCluster(4)
doSNOW::registerDoSNOW(cl)
foreach::foreach(i = 1:nrow(j)) foreach::`%dopar%` {
if(j$Inspection_Completed[i] == TRUE) {
next
}
j$Assigned_ID <- a$Driver[match(j$car_name, a$CarName)]
j$Title <- c$Title[match(j$Site_ID, c$LocationID)]
j$Status <- c$Status[match(j$Site_ID, c$LocationID)]
}
stopCluster(cl)
However, when I run the code above I receive several errors.
Error: unexpected symbol in "foreach::foreach(i = 1:nrow(j)) foreach"
And:
Error: object 'i' not found
Lastly:
Error: unexpected '}' in "}"
I'm not sure why I'm getting these errors. None of the columns in any of the data frames are factors and I haven't been able to spot any mismatched parentheses or brackets. I've also done this without the snow and doSNOW packages and the result is the same. I've ran it without the tick marks around dopar as well with the same result.

(I didn't know this before.)
R doesn't like infix operators with the ::-notation. Even if you're doing that for namespace management, R isn't having it:
1L %in% 1:2
# [1] TRUE
1L base::%in% 1:2
# Error: unexpected symbol in "1L base"
1L base::`%in%` 1:2
# Error: unexpected symbol in "1L base"
Workarounds:
Redefine your own infix that just mimics the other, as in
`%myin%` <- base::`%in%`
1L %myin% 1:2
# [1] TRUE
Use explicit namespace inclusion with library(foreach) before that point in your code, and just use %dopar%. (Not that it helps much, but using library(foreach) does not mean you cannot use foreach::foreach, though it is unnecessary.)

Related

Loop Changing to Matrix then Running tests

I have a dataframe with ~9000 rows of human coded data in it, two coders per item so about 4500 unique pairs. I want to break the dataset into each of these pairs, so ~4500 dataframes, run a kripp.alpha on the scores that were assigned, and then save those into a coder sheet I have made. I cannot get the loop to work to do this.
I can get it to work individually, using this:
example.m <- as.matrix(example.m)
s <- kripp.alpha(example.m)
example$alpha <- s$value
However, when trying a loop I am getting either "Error in get(v) : object 'NA' not found" when running this:
for (i in items) {
v <- i
v <- v[c("V1","V2")]
v <- assign(v, as.matrix(get(v)))
s <- kripp.alpha(v)
i$alpha <- s$value
}
Or am getting "In i$alpha <- s$value : Coercing LHS to a list" when running:
for (i in items) {
i.m <- i[c("V1","V2")]
i.m <- as.matrix(i.m)
s <- kripp.alpha(i.m)
i$alpha <- s$value
}
Here is an example set of data. Items is a list of individual dataframes.
l <- as.data.frame(matrix(c(4,3,3,3,1,1,3,3,3,3,1,1),nrow=2))
t <- as.data.frame(matrix(c(4,3,4,3,1,1,3,3,1,3,1,1),nrow=2))
items <- c("l","t")
I am sure this is a basic question, but what I want is for each file, i, to add a column with the alpha score at the end. Thanks!
Your problem is with scoping and extracting names from objects when referenced through strings. You'd need to eval() some of your object to make your current approach work.
Here's another solution
library("irr") # For kripp.alpha
# Produce the data
l <- as.data.frame(matrix(c(4,3,3,3,1,1,3,3,3,3,1,1),nrow=2))
t <- as.data.frame(matrix(c(4,3,4,3,1,1,3,3,1,3,1,1),nrow=2))
# Collect the data as a list right away
items <- list(l, t)
Now you can sapply() directly over the elements in the list.
sapply(items, function(v) {
kripp.alpha(as.matrix(v[c("V1","V2")]))$value
})
which produces
[1] 0.0 -0.5

Calculating distance using latitude and longitude error [duplicate]

When working with R I frequently get the error message "subscript out of bounds". For example:
# Load necessary libraries and data
library(igraph)
library(NetData)
data(kracknets, package = "NetData")
# Reduce dataset to nonzero edges
krack_full_nonzero_edges <- subset(krack_full_data_frame, (advice_tie > 0 | friendship_tie > 0 | reports_to_tie > 0))
# convert to graph data farme
krack_full <- graph.data.frame(krack_full_nonzero_edges)
# Set vertex attributes
for (i in V(krack_full)) {
for (j in names(attributes)) {
krack_full <- set.vertex.attribute(krack_full, j, index=i, attributes[i+1,j])
}
}
# Calculate reachability for each vertix
reachability <- function(g, m) {
reach_mat = matrix(nrow = vcount(g),
ncol = vcount(g))
for (i in 1:vcount(g)) {
reach_mat[i,] = 0
this_node_reach <- subcomponent(g, (i - 1), mode = m)
for (j in 1:(length(this_node_reach))) {
alter = this_node_reach[j] + 1
reach_mat[i, alter] = 1
}
}
return(reach_mat)
}
reach_full_in <- reachability(krack_full, 'in')
reach_full_in
This generates the following error Error in reach_mat[i, alter] = 1 : subscript out of bounds.
However, my question is not about this particular piece of code (even though it would be helpful to solve that too), but my question is more general:
What is the definition of a subscript-out-of-bounds error? What causes it?
Are there any generic ways of approaching this kind of error?
This is because you try to access an array out of its boundary.
I will show you how you can debug such errors.
I set options(error=recover)
I run reach_full_in <- reachability(krack_full, 'in')
I get :
reach_full_in <- reachability(krack_full, 'in')
Error in reach_mat[i, alter] = 1 : subscript out of bounds
Enter a frame number, or 0 to exit
1: reachability(krack_full, "in")
I enter 1 and I get
Called from: top level
I type ls() to see my current variables
1] "*tmp*" "alter" "g"
"i" "j" "m"
"reach_mat" "this_node_reach"
Now, I will see the dimensions of my variables :
Browse[1]> i
[1] 1
Browse[1]> j
[1] 21
Browse[1]> alter
[1] 22
Browse[1]> dim(reach_mat)
[1] 21 21
You see that alter is out of bounds. 22 > 21 . in the line :
reach_mat[i, alter] = 1
To avoid such error, personally I do this :
Try to use applyxx function. They are safer than for
I use seq_along and not 1:n (1:0)
Try to think in a vectorized solution if you can to avoid mat[i,j] index access.
EDIT vectorize the solution
For example, here I see that you don't use the fact that set.vertex.attribute is vectorized.
You can replace:
# Set vertex attributes
for (i in V(krack_full)) {
for (j in names(attributes)) {
krack_full <- set.vertex.attribute(krack_full, j, index=i, attributes[i+1,j])
}
}
by this:
## set.vertex.attribute is vectorized!
## no need to loop over vertex!
for (attr in names(attributes))
krack_full <<- set.vertex.attribute(krack_full,
attr, value = attributes[,attr])
It just means that either alter > ncol( reach_mat ) or i > nrow( reach_mat ), in other words, your indices exceed the array boundary (i is greater than the number of rows, or alter is greater than the number of columns).
Just run the above tests to see what and when is happening.
Only an addition to the above responses: A possibility in such cases is that you are calling an object, that for some reason is not available to your query. For example you may subset by row names or column names, and you will receive this error message when your requested row or column is not part of the data matrix or data frame anymore.
Solution: As a short version of the responses above: you need to find the last working row name or column name, and the next called object should be the one that could not be found.
If you run parallel codes like "foreach", then you need to convert your code to a for loop to be able to troubleshoot it.
If this helps anybody, I encountered this while using purr::map() with a function I wrote which was something like this:
find_nearby_shops <- function(base_account) {
states_table %>%
filter(state == base_account$state) %>%
left_join(target_locations, by = c('border_states' = 'state')) %>%
mutate(x_latitude = base_account$latitude,
x_longitude = base_account$longitude) %>%
mutate(dist_miles = geosphere::distHaversine(p1 = cbind(longitude, latitude),
p2 = cbind(x_longitude, x_latitude))/1609.344)
}
nearby_shop_numbers <- base_locations %>%
split(f = base_locations$id) %>%
purrr::map_df(find_nearby_shops)
I would get this error sometimes with samples, but most times I wouldn't. The root of the problem is that some of the states in the base_locations table (PR) did not exist in the states_table, so essentially I had filtered out everything, and passed an empty table on to mutate. The moral of the story is that you may have a data issue and not (just) a code problem (so you may need to clean your data.)
Thanks for agstudy and zx8754's answers above for helping with the debug.
I sometimes encounter the same issue. I can only answer your second bullet, because I am not as expert in R as I am with other languages. I have found that the standard for loop has some unexpected results. Say x = 0
for (i in 1:x) {
print(i)
}
The output is
[1] 1
[1] 0
Whereas with python, for example
for i in range(x):
print i
does nothing. The loop is not entered.
I expected that if x = 0 that in R, the loop would not be entered. However, 1:0 is a valid range of numbers. I have not yet found a good workaround besides having an if statement wrapping the for loop
This came from standford's sna free tutorial
and it states that ...
# Reachability can only be computed on one vertex at a time. To
# get graph-wide statistics, change the value of "vertex"
# manually or write a for loop. (Remember that, unlike R objects,
# igraph objects are numbered from 0.)
ok, so when ever using igraph, the first roll/column is 0 other than 1, but matrix starts at 1, thus for any calculation under igraph, you would need x-1, shown at
this_node_reach <- subcomponent(g, (i - 1), mode = m)
but for the alter calculation, there is a typo here
alter = this_node_reach[j] + 1
delete +1 and it will work alright
What did it for me was going back in the code and check for errors or uncertain changes and focus on need-to-have over nice-to-have.

Running 'xlsx' processes in parallel, using the 'parallel' R package

I have a project where I need to process some data from an Excel file with R. I must use the 'xlsx' package because of some specific functions.
First, I wrote a script, which works as expected without errors.
options(java.parameters = "-Xmx4096m") #for extra memory
library(xlsx)
wb <- loadWorkbook(file = "my_excel.xlsx")
sheet1 <- getSheets(wb)[[1]]
rows <- getRows(sheet1)
make_df <- function (x) {
cells <- getCells(rows[x])
styles <- sapply(cells, getCellStyle)
cellColor <- function(style) {
fg <- style$getFillForegroundXSSFColor()
rgb <- tryCatch(fg$getRgb(), error = function(e) NULL)
rgb <- paste(rgb, collapse = "")
return(rgb)
}
colors <- sapply(styles, cellColor)
if (!any(colors == "ff0000")) {
df[nrow(df) + 1, ] <- sapply(cells, getCellValue) #I define this 'df' somewhere in the code; this part could be improved
}
}
df <- sapply(1 : length(rows), make_df)
In short, I am looking for the rows in Excel where there are no red-colored cells, like described here. The problem is that the Excel file is very big, and it takes a lot of time to process.
What I'd like to do is to run the row checking in parallel, to be more efficient, so I added:
cl = makeCluster(detectCores() - 1)
clusterEvalQ(cl=cl, c(library(xlsx))) #sharing the package with the workers
clusterExport(cl = cl, c('rows')) #sharing the 'row' variable with the workers
df <- parSapply(cl, 1 : length(rows), make_df)
And after running this, I get the following error:
Error in checkForRemoteErrors(val) :
7 nodes produced errors; first error: RcallMethod: attempt to call a method of a NULL object.
I tried the parallelization with another example, without using 'xlsx' functions, and it worked.
After some digging, I found this post which offered somewhat of an answer (more like a workaround), but I can't seem to be able to implement it.
Is there a clean way to do what I'm trying to do here?
If not, then what would be the best solution in this case?

R data tables accessing columns by name

If I have a data table, foo, in R with a column named "date", I can get the vector of date values by the notation
foo[, date]
(Unlike data frames, date doesn't need to be in quotes).
How can this be done programmatically? That is, if I have a variable x whose value is the string "date", then how to I access the column of foo with that name?
Something that sort of works is to create a symbol:
sym <- as.name(x)
v <- foo[, eval(sym)]
...
As I say, that sort of works, but there is something not quite right about it. If that code is inside a function myFun in package myPackage, then it seems that it doesn't work if I explicitly use the package through:
myPackage::myFun(...)
I get an error message saying "undefined columns selected".
[edited] Some more details
Suppose I create a package called myPackage. This package has a single file with the following in it:
library(data.table)
#' export
myFun <- function(table1) {
names1 <- names(table1)
name1 <- names1[[1]]
sym <- as.Name(name1)
table1[, eval(sym)]
}
If I load that function using R Studio, then
myFun(tbl)
returns the first column of the data table tbl.
On the other hand, if I call
myPackage::myFun(tbl)
it doesn't work. It complains about
Error in .subset(x, j) : invalid subscript type 'builtin'
I'm just curious as to why myPackage:: would make this difference.
A quick way which points to a longer way is this:
subset(foo, TRUE, date)
The subset function accepts unquoted symbol/names for its 'subset' and 'select' arguments. (Its author, however, thinks this was a bad idea and suggests we use formulas instead.) This was the jumping off place for sections of Hadley Wickham's Advanced Programming webpages (and book).: http://adv-r.had.co.nz/Computing-on-the-language.html and http://adv-r.had.co.nz/Functional-programming.html . You can also look at the code for subset.data.frame:
> subset.data.frame
function (x, subset, select, drop = FALSE, ...)
{
r <- if (missing(subset))
rep_len(TRUE, nrow(x))
else {
e <- substitute(subset)
r <- eval(e, x, parent.frame())
if (!is.logical(r))
stop("'subset' must be logical")
r & !is.na(r)
}
vars <- if (missing(select))
TRUE
else {
nl <- as.list(seq_along(x))
names(nl) <- names(x)
eval(substitute(select), nl, parent.frame())
}
x[r, vars, drop = drop]
}
The problem with the use of "naked" expressions that get passed into functions is that their evaluation frame is sometimes not what is expected. R formulas, like other functions, carry a pointer to the environment in which they were defined.
I think the problem is that you've defined myFun in your global environment, so it only appeared to work.
I changed as.Name to as.name, and created a package with the following functions:
library(data.table)
myFun <- function(table1) {
names1 <- names(table1)
name1 <- names1[[1]]
sym <- as.name(name1)
table1[, eval(sym)]
}
myFun_mod <- function(dt) {
# dt[, eval(as.name(colnames(dt)[1]))]
dt[[colnames(dt)[1]]]
}
Then, I tested it using this:
library(data.table)
myDt <- data.table(a=letters[1:3],b=1:3)
myFun(myDt)
myFun_mod(myDt)
myFun didn't work
myFun_mod did work
The output:
> library(test)
> myFun(myDt)
Error in eval(expr, envir, enclos) : object 'a' not found
> myFun_mod(myDt)
[1] "a" "b" "c"
then I added the following line to the NAMESPACE file:
import(data.table)
This is what #mnel was talking about with this link:
Using data.table package inside my own package
After adding import(data.table), both functions work.
I'm still not sure why you got the particular .subset error, which is why I went though the effort of reproducing the result...

tryCatch with a complicated function and plyr in R

I've got a complicated, long function that I'm using to do simulations. It can generate errors, mostly having to do with random vectors ending up with equal values with zero-variance, getting fed either into PCA's or logistic regressions.
I'm executing it on a cluster using doMC and plyr. I don't want to tryCatch every little thing inside of the function, because the possibilities for errors are many and the probabilities of each of them are small.
How do I tryCatch each run, rather than tryCatching every little line?? The code is something like this:
iteration = function(){
a really long simulation function where errors can happen
}
reps = 10000
results = llply(1:reps, function(idx){out<-iteration()},.parallel=TRUE)
EDIT about a year later:
The foreach package makes this substantially easier than it is with plyr
library(foreach)
output <- foreach(i=1:reps, .errorhandling = 'remove')%dopar%{
function
}
Can you wrap the try catch loop in the function you pass to llply?
results = llply(1:reps, function(idx){
out = NA
try({
out<-iteration()
}, silent=T)
out
},.parallel=TRUE)
You can put tryCatch within you function iteration, For example:
iteration <- function(idx){
tryCatch(
{ idx <- idx+1
## very long treatments here...
## I add a dummy error here to test my tryCatch
if(idx %% 2000 ==0) stop("too many iterations")
},error = function(e) print(paste('error',idx)))
}
Now testing it within llply,
library(plyr)
reps = 10000
results = llply(1:reps, iteration,.parallel=TRUE)
1] "error 2000"
[1] "error 4000"
[1] "error 6000"
[1] "error 8000"
[1] "error 10000"

Resources