I have a function f(x, y) that returns a list of 8 logical vectors, where x and y are integers. I want to populate a three-dimensional array M so that M[x, y, z] is the number of TRUEs in the zth element of f(x, y). I can do this with nested for loops, but I know those are frowned upon in R. I think there's a more elegant way, using either outer or rbind and sapply but I can't figure it out. Here's my code with the nested for loops:
M <- array(dim=c(150, 200, 8))
for(j in 1:150) {
for(k in 1:200) {
rsu <- f(j, k)
for(z in 1:8) {
M[j, k, z] <- sum(rsu[[z]])
}}}
What is a more efficient/elegant way of populating this array that gives the same result?
Edited to add: For purposes of this question, treat f as a black box. In reality it involves various calculations and lookups about eight different satellites, but here's a dummy function that will generate some data for this example:
is.prime <- function(n) n == 2L || all(n %% 2L:ceiling(sqrt(n)) != 0)
#source for is.prime function:
# https://stackoverflow.com/questions/19767408/prime-number-function-in-r
f <- function(x,y) {
retlist <- list()
retlist[[1]] <- c(FALSE, FALSE, rep(TRUE, x))
retlist[[2]] <- c(TRUE, TRUE, rep(FALSE, y), rep(TRUE, y))
retlist[[3]] <- c(is.prime(x), is.prime(y), is.prime(x+y), is.prime(x+y+3), sapply(x:(2*(x+y)), is.prime))
retlist[[4]] <- c(x+y %% 5 == 0, x*y %% 6 ==0)
retlist[[5]] <- retlist[[(x+y) %% 4 + 1]]
retlist[[6]] <- retlist[[y %% 4 + 1]]
retlist[[7]] <- retlist[[x %% 6 + 1]]
retlist[[8]] <- sapply(abs(x-y):(7L*x+y+1), is.prime)
return(retlist)
}
Here's how to the populate the array, giving the same results, using nested functions and sapply instead of for:
f2 <- function(x,y) {
rsu <- f(x,y)
values <- vapply(1:8, FUN=function(z) sum(rsu[[z]]), FUN.VALUE=1L)
}
f3 <- function(x) array(data=t(sapply(1:200, FUN=function(w) f2(x,w))), dim=c(1,200,8))
M2 <- array(data=t(sapply(1:150, FUN=f3)), dim=c(150,200,8))
Here's how to do it with outer. But it's unintuitive; the matrix data are assigned within the function. I don't understand why I need to invoke Vectorize(f2) here instead of simply f2.
M2 <- array(dim=c(150, 200, 8))
f2 <- function(x, y) {
rsu <- f(x, y)
M2[x, y, ] <<- vapply(1:8, FUN=function(z) sum(rsu[[z]]), FUN.VALUE=1L)
return(0L)
}
ABC <- outer(1:150, 1:200, Vectorize(f2))
Related
I have a numeric data.frame df with 134946 rows x 1938 columns.
99.82% of the data are NA.
For each pair of (distinct) columns "P1" and "P2", I need to find which rows have non-NA values for both and then do some operations on those rows (linear model).
I wrote a script that does this, but it seems quite slow.
This post seems to discuss a related task, but I can't immediately see if or how it can be adapted to my case.
Borrowing the example from that post:
set.seed(54321)
nr = 1000;
nc = 900;
dat = matrix(runif(nr*nc), nrow=nr)
rownames(dat) = paste(1:nr)
colnames(dat) = paste("time", 1:nc)
dat[sample(nr*nc, nr*nc*0.9)] = NA
df <- as.data.frame(dat)
df_ps <- names(df)
N_ps <- length(df_ps)
My script is:
tic = proc.time()
out <- do.call(rbind,sapply(1:(N_ps-1), function(i) {
if (i/10 == floor(i/10)) {
cat("\ni = ",i,"\n")
toc = proc.time();
show(toc-tic);
}
do.call(rbind,sapply((i+1):N_ps, function(j) {
w <- which(complete.cases(df[,i],df[,j]))
N <- length(w)
if (N >= 5) {
xw <- df[w,i]
yw <- df[w,j]
if ((diff(range(xw)) != 0) & (diff(range(yw)) != 0)) {
s <- summary(lm(yw~xw))
o <- c(i,j,N,s$adj.r.squared,s$coefficients[2],s$coefficients[4],s$coefficients[8],s$coefficients[1],s$coefficients[3],s$coefficients[7])} else {
o <- c(i,j,N,rep(NA,7))
}
} else {o <- NULL}
return(o)
},simplify=F))
}
,simplify=F))
toc = proc.time();
show(toc-tic);
This takes about 10 minutes on my machine.
You can imagine what happens when I need to handle a much larger (although more sparse) data matrix. I never managed to finish the calculation.
Question: do you think this could be done more efficiently?
The thing is I don't know which operations take more time (subsetting of df, in which case I would remove duplications of that? appending matrix data, in which case I would create a flat vector and then convert it to matrix at the end? ...).
Thanks!
EDIT following up from minem's post
As shown by minem, the speed of this calculation strongly depended on the way linear regression parameters were calculated. Therefore changing that part was the single most important thing to do.
My own further trials showed that: 1) it was essential to use sapply in combination with do.call(rbind, rather than any flat vector, to store the data (I am still not sure why - I might make a separate post about this); 2) on the original matrix I am working on, much more sparse and with a much larger nrows/ncolumns ratio than the one in this example, using the information on the x vector available at the start of each i iteration to reduce the y vector at the start of each j iteration increased the speed by several orders of magnitude, even compared with minem's original script, which was already much better than mine above.
I suppose the advantage comes from filtering out many rows a priori, thus avoiding costly xna & yna operations on very long vectors.
The modified script is the following:
set.seed(54321)
nr = 1000;
nc = 900;
dat = matrix(runif(nr*nc), nrow = nr)
rownames(dat) = paste(1:nr)
colnames(dat) = paste("time", 1:nc)
dat[sample(nr*nc, nr*nc*0.90)] = NA
df <- as.data.frame(dat)
df_ps <- names(df)
N_ps <- length(df_ps)
tic = proc.time()
naIds <- lapply(df, function(x) !is.na(x))
dl <- as.list(df)
rl <- sapply(1:(N_ps - 1), function(i) {
if ((i-1)/10 == floor((i-1)/10)) {
cat("\ni = ",i,"\n")
toc = proc.time();
show(toc-tic);
}
x <- dl[[i]]
xna <- which(naIds[[i]])
rl2 <- sapply((i + 1):N_ps, function(j) {
y <- dl[[j]][xna]
yna <- which(naIds[[j]][xna])
w <- xna[yna]
N <- length(w)
if (N >= 5) {
xw <- x[w]
yw <- y[yna]
if ((min(xw) != max(xw)) && (min(yw) != max(yw))) {
# extracts from lm/lm.fit/summary.lm functions
X <- cbind(1L, xw)
m <- .lm.fit(X, yw)
# calculate adj.r.squared
fitted <- yw - m$residuals
rss <- sum(m$residuals^2)
mss <- sum((fitted - mean(fitted))^2)
n <- length(m$residuals)
rdf <- n - m$rank
# rdf <- df.residual
r.squared <- mss/(mss + rss)
adj.r.squared <- 1 - (1 - r.squared) * ((n - 1L)/rdf)
# calculate se & pvals
p1 <- 1L:m$rank
Qr <- m$qr
R <- chol2inv(Qr[p1, p1, drop = FALSE])
resvar <- rss/rdf
se <- sqrt(diag(R) * resvar)
est <- m$coefficients[m$pivot[p1]]
tval <- est/se
pvals <- 2 * pt(abs(tval), rdf, lower.tail = FALSE)
res <- c(m$coefficients[2], se[2], pvals[2],
m$coefficients[1], se[1], pvals[1])
o <- c(i, j, N, adj.r.squared, res)
} else {
o <- c(i,j,N,rep(NA,7))
}
} else {o <- NULL}
return(o)
}, simplify = F)
do.call(rbind, rl2)
}, simplify = F)
out2 <- do.call(rbind, rl)
toc = proc.time();
show(toc - tic)
E.g. try with nr=100000; nc=100.
I should probably mention that I tried using indices, i.e.:
naIds <- lapply(df, function(x) which(!is.na(x)))
and then obviously generating w by intersection:
w <- intersect(xna,yna)
N <- length(w)
This however is slower than the above.
Larges bottleneck is lm function, because there are lot of checks & additional calculations, that you do not necessarily need. So I extracted only the needed parts.
I got this to run in +/- 18 seconds.
set.seed(54321)
nr = 1000;
nc = 900;
dat = matrix(runif(nr*nc), nrow = nr)
rownames(dat) = paste(1:nr)
colnames(dat) = paste("time", 1:nc)
dat[sample(nr*nc, nr*nc*0.9)] = NA
df <- as.data.frame(dat)
df_ps <- names(df)
N_ps <- length(df_ps)
tic = proc.time()
naIds <- lapply(df, function(x) !is.na(x)) # outside loop
dl <- as.list(df) # sub-setting list elements is faster that columns
rl <- sapply(1:(N_ps - 1), function(i) {
x <- dl[[i]]
xna <- naIds[[i]] # relevant logical vector if not empty elements
rl2 <- sapply((i + 1):N_ps, function(j) {
y <- dl[[j]]
yna <- naIds[[j]]
w <- xna & yna
N <- sum(w)
if (N >= 5) {
xw <- x[w]
yw <- y[w]
if ((min(xw) != max(xw)) && (min(xw) != max(xw))) { # faster
# extracts from lm/lm.fit/summary.lm functions
X <- cbind(1L, xw)
m <- .lm.fit(X, yw)
# calculate adj.r.squared
fitted <- yw - m$residuals
rss <- sum(m$residuals^2)
mss <- sum((fitted - mean(fitted))^2)
n <- length(m$residuals)
rdf <- n - m$rank
# rdf <- df.residual
r.squared <- mss/(mss + rss)
adj.r.squared <- 1 - (1 - r.squared) * ((n - 1L)/rdf)
# calculate se & pvals
p1 <- 1L:m$rank
Qr <- m$qr
R <- chol2inv(Qr[p1, p1, drop = FALSE])
resvar <- rss/rdf
se <- sqrt(diag(R) * resvar)
est <- m$coefficients[m$pivot[p1]]
tval <- est/se
pvals <- 2 * pt(abs(tval), rdf, lower.tail = FALSE)
res <- c(m$coefficients[2], se[2], pvals[2],
m$coefficients[1], se[1], pvals[1])
o <- c(i, j, N, adj.r.squared, res)
} else {
o <- c(i,j,N,rep(NA,6))
}
} else {o <- NULL}
return(o)
}, simplify = F)
do.call(rbind, rl2)
}, simplify = F)
out2 <- do.call(rbind, rl)
toc = proc.time();
show(toc - tic);
# user system elapsed
# 17.94 0.11 18.44
i'm using this code:
library("partitions")
x <- c(2,4,6)
parts <- listParts(length(x))
out <- rapply(parts, function(ii) x[ii], how="replace")
to calculate list vector of all partition, but i would be like list of partition with k dimension, for example:
k=2
{(2),(4,6)}{(4),(2,6)}{(6),(2,4)}
Maybe there are better ways of doing this but the following does what you want.
library(partitions)
funParts <- function(x, k){
parts <- listParts(length(x))
res <- lapply(parts, function(inx) sapply(inx, function(i) x[i]))
res <- unlist(res, recursive = FALSE)
res <- res[sapply(res, length) <= k]
unique(res)
}
x <- c(2,4,6)
k <- 2
funParts(x, 2)
funParts(x, 1)
funParts(4:10, 3)
I want to use mapply to apply fun to all elements of a matrix. I'm not sure how to use mapply when the function to be applied makes use of each elements' location in the matrix.
fun <- function(theta, mat, i, j){
sum_nearby <- function(mat,i,j,dist){
if (j - dist < 1) mat[i, j + dist]
else if (j + dist > ncol(mat)) mat[i, j - dist]
else mat[i, j - dist] + mat[i, j + dist]
}
g0 <- -2*mat[i,j]
g1 <- g0*sum_nearby(mat,i,j,1)
-log1p(exp(theta %*% c(g0, g1)))
}
Try mapply over the row and column indices like this where fun is the function defined in the question. The result is a numeric vector v:
# test inputs
theta <- 1:2
mat <- as.matrix(BOD)
v <- mapply(fun, row(mat), col(mat), MoreArgs = list(theta = theta, mat = mat))
Then it can be summed like this sum(v) or reshaped into a matrix with the same dimensions as mat like this: replace(mat, TRUE, v) or array(v, dim(mat)) or matrix(v, nrow(mat)) or 0*mat+v
Note: Alternatives would be to use outer returning a matrix having the same dimensions as mat:
outer(1:nrow(mat), 1:ncol(mat), Vectorize(function(i, j) fun(theta, mat, i, j)))
or apply returning a vector as in mapply solution above:
apply(cbind(c(row(mat)), c(col(mat))), 1, function(ix) fun(theta, mat, ix[1], ix[2]))
Suppose I have a vector x which I want to convolve with itself n times. What is the good way to do this in R?
Suppose that we already have a function conv(u,v) that convolves two vectors.
I can do this:
autoconv<-function(x,n){
r<-1;
for(i in 1:n){
r<-conv(r,x);
}
return(r);
}
is there a more efficient way?
Take the Fast Fourier Transform (fft) of x, raise it to the kth power and take the inverse fft. Then compare that to performing convolutions of k copies of x. No packages are used.
# set up test data
set.seed(123)
k <- 3 # no of vectors to convolve
n <- 32 # length of x
x <- rnorm(n)
# method 1 using fft and inverse fft
yy <- Re(fft(fft(x)^k, inverse = TRUE) / n)
# method 2 using repeated convolutions
y <- x
if (k >= 2) for(i in 2:k) y <- convolve(x, y, FALSE)
# check that the two methods give the same result
all.equal(y, yy)
## TRUE
autoconv <- function(x, n){
if(n == 0){
return(1)
} else if(n == 1){
return(x)
} else {
i <- 2
xi <- conv(x,x)
while(n %% i != 0){
i <- i + 1
xi <- conv(xi,x)
}
return(autoconv(xi,n/i))
}
}
This will call conv() once for each prime factor of n, rather than n times.
I'd like to perform this function on a matrix 100 times. How can I do this?
v = 1
m <- matrix(0,10,10)
rad <- function(x) {
idx <- sample(length(x), size=1)
flip = sample(0:1,1,rep=T)
if(flip == 1) {
x[idx] <- x[idx] + v
} else if(flip == 0) {
x[idx] <- x[idx] - v
return(x)
}
}
This is what I have so far but doesn't work.
for (i in 1:100) {
rad(m)
}
I also tried this, which seemed to work, but gave me an output of like 5226 rows for some reason. The output should just be a 10X10 matrix with changed values depending on the conditions of the function.
reps <- unlist(lapply(seq_len(100), function(x) rad(m)))
Ok I think I got it.
The return statement in your function is only inside a branch of an if statement, so it returns a matrix with a probability of ~50% while in the other cases it does not return anything; you should change the code function into this:
rad <- function(x) {
idx <- sample(length(x), size=1)
flip = sample(0:1,1,rep=T)
if(flip == 1) {
x[idx] <- x[idx] + v
} else if(flip == 0) {
x[idx] <- x[idx] - v
}
return(x)
}
Then you can do:
for (i in 1:n) {
m <- rad(m)
}
Note that this is semantically equal to:
for (i in 1:n) {
tmp <- rad(m) # return a modified verion of m (m is not changed yet)
# and put it into tmp
m <- tmp # set m equal to tmp, then in the next iteration we will
# start from a modified m
}
When you run rad(m) is not do changes on m.
Why?
It do a local copy of m matrix and work on it in the function. When function end it disappear.
Then you need to save what function return.
As #digEmAll write the right code is:
for (i in 1:100) {
m <- rad(m)
}
You don't need a loop here. The whole operation can be vectorized.
v <- 1
m <- matrix(0,10,10)
n <- 100 # number of random replacements
idx <- sample(length(m), n, replace = TRUE) # indices
flip <- sample(c(-1, 1), n, replace = TRUE) # subtract or add
newVal <- aggregate(v * flip ~ idx, FUN = sum) # calculate new values for indices
m[newVal[[1]]] <- m[newVal[[1]]] + newVal[[2]] # add new values