Race condition with the parallel package in R - r

I am trying to execute a function with side effect on a vector in parallel. For example, in the following snippet, add.entry has the side effect of modifying master.
library(parallel)
master <- data.frame()
add.entry <- function(x) {
row <- data.frame(a = x, b = sin(x))
master <- rbind(master, row)
}
mclapply(1:42, add.entry)
The output I get is
[[1]] a b 1 1 0.841471
[[2]] a b 1 2 0.9092974
[[3]] a b 1 3 0.14112
[[4]] a b 1 4 -0.7568025
However, master contains nothing afterwards. My best guess is that there is some race condition involved. How can I fix it, like maybe declaring a critical section?

it is very slow to grow an object inside a loop (cf. https://privefl.github.io/blog/why-loops-are-slow-in-r/).
when you use parallelism, you don't rbind() to the master in your global environment, but to some copies of it in your different forks (cf. https://privefl.github.io/blog/a-guide-to-parallelism-in-r/).
mclapply already returns something (like lapply).
You can simply do
library(parallel)
add.entry <- function(x) {
data.frame(a = x, b = sin(x))
}
res_list <- mclapply(1:42, add.entry)
master <- do.call("rbind", res_list)

Related

How to transform the object of a function in r?

I want to create a function that transforms its object.
I have tried to transform the variable as you would normally, but within the function.
This works:
vec <- c(1, 2, 3, 3)
vec <- (-1*vec)+1+max(vec, na.rm = T)
[1] 3 2 1 1
This doesn't work:
vec <- c(1, 2, 3, 3)
func <- function(x){
x <- (-1*x)+1+max(x, na.rm = T))
}
func(vec)
vec
[1] 1 2 3 3
R is functional so normally one returns the output. If you want to change
the value of the input variable to take on the output value then it is normally done by the caller, not within the function. Using func from the question it would normally be done like this:
vec <- func(vec)
Furthermore, while you can overwrite variables it is, in general, not a good
idea. It makes debugging difficult. Is the current value of vec the
input or output and if it is the output what is the value of the input? We
don't know since we have overwritten it.
func_ovewrite
That said if you really want to do this despite the comments above then:
# works but not recommended
func_overwrite <- function(x) eval.parent(substitute({
x <- (-1*x)+1+max(x, na.rm = TRUE)
}))
# test
v <- c(1, 2, 3, 3)
func_overwrite(v)
v
## [1] 3 2 1 1
Replacement functions
Despite R's functional nature it actually does provide one facility for overwriting although the function in the question is not really a good candidate for it so let us change the example to provide a function incr which increments the input variable by a given value. That is, it does this:
x <- x + b
We can write this in R as:
`incr<-` <- function(x, value) x + value
# test
xx <- 3
incr(xx) <- 10
xx
## [1] 13
T vs. TRUE
One other comment. Do not use T for true. Always write it out. TRUE is a reserved name in R but T is a valid variable name so it can lead to hard to find errors such as when someone uses T for temperature.

Problems with forloops inside a function R

I have a problem with a function of the following kind:
fun.name <- function(x,y) {
a<-x
b<-y
for (i in c(a, b)){
i<-i+1
print (i)
}
print(a)
print(b)
}
fun.name(1, 2)
The result is
[1] 2
[1] 3
[1] 1
[1] 2
The same result is obtained if I do not create any a and b and I simply keep x and y ( fun.name <- function(x,y) { for (i in c(a, b))...).
I cannot understand this behavior.
What I wanted was a function which adds one to every arguments and prints the results. Why does not the loop modify the variables a and b when it is defined within the function? I guess it is a problem of environments, and that I have not understood the nature of a function arguments.
Thank you for any suggestions.
I actually expect to see your current output. Here is your code, formatted, with explanations as comments:
fun.name <- function(x,y) {
a <- x
b <- y
for (i in c(a, b)) { # i in (1, 2)
# first iteration: i = 2, print 2
# second iteration: i = 3, print 3
i <- i+1
print(i)
}
print(a) # prints 1 (a was only assigned once)
print(b) # prints 2 (same reason as above)
}
fun.name(1, 2)
There are no changes to a and b after their initial assignments inside the function. But, even if there were changes, the variables a and b would not even be visible outside the scope of the function.

Nonstandard evaluation of list of variable names

I'm writing an estimation procedure in R loops through a list of variables names from a data.frame that the user declares. I'm trying to avoid requiring the user to enquote the variables to make their life easier (the goal is to upload this to CRAN, so we care a lot about user experience).
To prevent R from trying to evaluate the variable names, I constructed the function alt() that is like an alternative to c() and list(), but does not evaluate the elements.
My question is how I can elegantly do away with the alt() function, so users can learn one less function. Here is a simple MWE that hopefully illustrates the problem:
## Construct non-evaluating list function
alt <- function(...) {
alt <- as.list(substitute(list(...)))
return(alt[-1])
}
## Construct function that enquotes non-evaluated vectors
## contained in 'alt()'. Perhaps enquoting variable names
## is unavoidable because the data set is stored as a
## data.frame, but at least the user will not have to do it.
restring <- function(vector) {
vector <- deparse(vector)
if (substr(vector, start = 1, stop = 2) == "c(") {
vector <- substr(vector, 3, nchar(vector) - 1)
vector <- strsplit(vector, ", ")[[1]]
}
return(vector)
}
## Example of a function that loops over the list above
## for a given data set. The function simply prints out
## the columns declared in each element of 'alt()'.
test <- function(data, vlist) {
for (i in 1:length(vlist)) {
print(paste0("Data set ", i, ":"))
print(data[, restring(vlist[[i]])])
}
}
## Construct example data
N <- 4
df <- data.frame(x1 = c(1, 2),
x2 = c(3, 4))
## Example of user-declared list of variables to loop over
vlist <- alt(x1, c(x1, x2))
## Output from running this example
> test(df, vlist)
[1] "Data set 1:"
[1] 1 2
[1] "Data set 2:"
x1 x2
1 1 3
2 2 4
The user could also have declared
test(df, alt(x1, c(x1, x2)))
But it would be nice if I did not have to require the user to use a different function to declare these lists of variables. If it could work using standard R functions, like
test(df, list(x1, c(x1, x2)))
that would be great, but I haven't been able to find a way other than performing some ungainly string manipulations using deparse(substitute()), similar to the restring() function (not sure how CRAN feels about that).
Any thoughts on this non-standard evaluation issue would be appreciated. Also, if alt() is easy enough to use that it is not worth removing, that would also be good to know.
A more compact option would be en_exprs from rlang
library(rlang)
alt1 <- function(...) enexprs(...)
test(df, alt1(x1, c(x1, x2)))
#[1] "Data set 1:"
#[1] 1 2
#[1] "Data set 2:"
# x1 x2
#1 1 3
#2 2 4
Or without using any external package, quote the expressions in a list
test(df, list(quote(x1), quote(c(x1, x2))))
#[1] "Data set 1:"
#[1] 1 2
#[1] "Data set 2:"
# x1 x2
#1 1 3
#2 2 4

R assigning function call to two different cores

So far, all I've read about parallel processing in R involves looking at multiple rows of one dataframe.
But what if I have 2 or three large dataframes that I want to perform a long function on? Can I assign each instance of the function to a specific core so I don't have to wait for it to work sequentially? I'm on windows.
Lets say this is the function:
AltAlleleRecounter <- function(names,data){
data$AC <- 0
numalleles <- numeric(length=nrow(data))
for(i in names){
genotype <- str_extract(data[,i],"^[^/]/[^/]")
GT <- dstrfw(genotype,c('character','character','character'),c(1L,1L,1L))
data[GT$V1!='.',]$AC <- data[GT$V1!='.',]$AC+GT[GT$V1!='.',]$V1+GT[GT$V1!='.',]$V3
numalleles[GT$V1!='.'] <- numalleles[GT$V1!='.'] + 2
}
data$AF <- data$AC/numalleles
return(data)
}
What I want to do is basically this (generic psuedocode):
wait_till_everything_is_finished(
core1="data1 <- AltAlleleRecounter(sampleset1,data1,1)",
core2="data2 <- AltAlleleRecounter(sampleset2,data2,2)",
core3="data3 <- AltAlleleRecounter(sampleset3,data3,3)"
)
where all three commands are running but the program doesn't progress until everything is done.
Edit:
Bryan's suggestion worked. I replaced "otherList" with my second list. This is example code:
myframelist <- list(data1,data2)
mynameslist <- list(names1,names2)
myframelist <- foreach(i=1:2) %dopar% (AltAlleleRecounter(mynameslist[[i]],myframelist[[i]]))
myfilenamelist <- list("data1.tsv","data2.tsv")
foreach(i=1:2) %dopar% (write.table(myframelist[[i]], file=myfilenamelist[[i]], quote=FALSE, sep="\t", row.names=FALSE, col.names=TRUE))
The data variables are dataframes and the name variables are just character vectors. You may need to reload some packages.
Try something like this:
library(doParallel)
library(foreach)
cl<-makeCluster(6) ## you can set up as many cores as you need/want/have here.
registerDoParallel(cl)
getDoParWorkers() # should be the number you registered. If not, something went wrong.
df1<-data.frame(matrix(1:9, ncol = 3))
df2<-data.frame(matrix(1:9, ncol = 3))
df3<-data.frame(matrix(1:9, ncol = 3))
mylist<-list(df1, df2, df3)
otherList<-list(1, 2, 3)
mylist<-foreach(i=1:3) %dopar% (mylist[[i]] * otherList[[i]])
mylist
[[1]]
X1 X2 X3
1 4 7
2 5 8
3 6 9
[[2]]
X1 X2 X3
2 8 14
4 10 16
6 12 18
[[3]]
X1 X2 X3
3 12 21
6 15 24
9 18 27
I do this fairly often with topic modeling different databases. The idea is to create lists of the data you want to apply your function to, then have foreach apply your function to those indexed lists in parallel. For your example you will have to make a list of your data.frames and another list of your samplesets.

Recalling an expression that was created via a sample

When I want to view structures as they were called, I can usually do it with enquote.
For an arbitrary list d this would be
> d <- list(a = 1, b = 2)
> enquote(d)
# quote(list(a = 1, b = 2))
But for an object created via a sample, it's different. sample does not show up in the quoted call.
> m <- matrix(sample(2))
> enquote(m)
# quote(c(2L, 1L))
Is there a way to show the call/expression that created m, so that sample shows up? So that the result would be something like
quote(matrix(sample(2))
Update: Simon's answer below is great, but I'd really like to see if I can get an answer that doesn't require I use substitute to create the matrix m.
I'm not a 100% sure if this serves your purpose, but you could try defining an expression with substitute before evaluating it to create m (no quote though...):
xpr <- substitute(matrix(sample(2)))
m <- eval(xpr)
Result:
> m
[,1]
[1,] 2
[2,] 1
> xpr
matrix(sample(2))
Cheers!

Resources