Methods for iteratively changing matrices in a set in R - r

Currently I am working on an R project where I iteratively have to make a lot of small changes to the final output, which is stored in an self made Class. The calculation time of the problem becomes very large if the amount of iterations increase. Unfortunately, a more vectorized version of the code is not possible, because the future values to be changed depend on the current changes.
A small example of the problem that I encouter is given below. For the calculations using the Example_Class the calculations are significantly longer then for the instance where just a matrix is used. Are their methods available to speed up the calculations in R? Or should I look at extionsion to for example C++?
Example_Class <- setClass(
"Example",
slots = c(
slot_1 = "matrix",
slot_2 = "matrix"
),
prototype=list(
slot_1 = matrix(1, ncol = 1, nrow = 7),
slot_2 = matrix(1, ncol = 4, nrow = 7)
)
)
Example <- Example_Class()
example_matrix_1 <- matrix(1, ncol = 1, nrow = 7)
example_matrix_2 <- matrix(1, ncol = 4, nrow = 7)
example_list <- list(example_matrix_1, example_matrix_2)
profile <- microbenchmark::microbenchmark(
example_matrix_2[3,3] <- (example_matrix_2[3,3] + 1)/2,
example_list[[2]][3,3] <- (example_list[[2]][3,3] + 1)/2,
Example#slot_2[3,3] <- (Example#slot_2[3,3] + 1)/2,
times = 1000
)
profile

Related

Process sets of rasters in parallel using lapp function from terra package

I have groups of rasters that I want to run a function on, I think probably using the lapp function from the {terra} package? Here is a simple example using toy data of the 'style' of thing I am hoping to accomplish.
library("terra")
rp10val = 106520
rp20val = 106520
rp50val = 154250
rp100val = 154250
rp200val = 154250
rp500val = 154250
rp1500val = 154250
sopval = 200
rp_10_vul = rast(nrow = 10, ncol = 10, vals = rep(rp10val, 10))
rp_20_vul = rast(nrow = 10, ncol = 10, vals = rep(rp20val, 10))
rp_50_vul = rast(nrow = 10, ncol = 10, vals = rep(rp50val, 10))
rp_100_vul = rast(nrow = 10, ncol = 10, vals = rep(rp100val, 10))
rp_200_vul = rast(nrow = 10, ncol = 10, vals = rep(rp200val, 10))
rp_500_vul = rast(nrow = 10, ncol = 10, vals = rep(rp500val, 10))
rp_1500_vul = rast(nrow = 10, ncol = 10, vals = rep(rp1500val, 10))
sop_tile = rast(nrow = 10, ncol = 10, vals = rep(sopval, 10))
input_raster_group <- c(rp_10_vul, rp_20_vul, rp_50_vul, rp_100_vul,
rp_200_vul, rp_500_vul, rp_1500_vul, sop_tile)
## In real world each of these lists would have rasters with different data in
input_raster_lists <- list(list(input_raster_group),
list(input_raster_group),
list(input_raster_group))
mcmapply(lapp,
input_raster_lists,
function(a,b,c,d,e,f,g,h){a+b+c+d+e+f+g+h},
mc.cores = 2)
## If working on windows, this might be better to try and run as proof of concept
# mapply(lapp,
# input_raster_lists,
# function(a,b,c,d,e,f,g,h){(a+b-c) / (d+e+f+g+h)})
Simplified data to make this easier to read
library("terra")
r10 = rast(nrow = 10, ncol = 10, vals = 10)
r20 = rast(nrow = 10, ncol = 10, vals = 20)
r50 = rast(nrow = 10, ncol = 10, vals = 50)
group <- c(r10, r20, r50)
input <- list(group, group, group)
You can use lapply to compute lists sequentially
x <- lapply(input, \(i) sum(i))
y <- lapply(input, \(i) app(i, sum))
z <- lapply(input, \(i) lapp(i, function(a,b,c){a+b+c}))
To use parallelization you could use e.g. parallel::parLapply or, as in your case, parallel::mcmapply.
SpatRaster objects hold a pointer (reference) to a C++ object that cannot be passed to a worker. Therefore you would need to use wrap and unwrap as I show below. I use proxy=TRUE to not force values to memory.
library(parallel)
inp <- lapply(input, \(x) wrap(x, proxy=TRUE))
f <- \(i) { unwrap(i) |> sum() |> wrap(proxy=TRUE)}
b <- mcmapply(f, inp)
out <- lapply(b, unwrap)
This approach may be useful in some cases, e.g. when you have to do many simulations on a relatively small raster that is memory.
In most cases you would do parallelization because you are dealing with large rasters that are on disk. In that case, you could just send the filenames to the workers, and create the SpatRasters there (and write the output to disk).
There is more discussion here

Finding index of array of matrices, that is closest to each element of another matrix in R

I have an array Q which has size nquantiles by nfeatures by nfeatures. In this, essentially the slice Q[1,,] would give me the first quantile of my data, across all nfeatures by nfeatures of my data.
What I am interested in, is using another matrix M (again of size nfeatures by nfeatures) which represents some other data, and asking the question to which quantile do each of the elements in M lie in Q.
What would be the quickest way to do this?
I reckon I could do double for loop across all rows and columns of the matrix M and come up with a solution similar to this: Finding the closest index to a value in R
But doing this over all nfeatures x nfeatures values will be very inefficient. I am hoping that there might exist a vectorized way of approaching this problem, but I am at a lost as to how to approach this.
Here is a reproducible way of the slow way I can approach the problem with O(N^2) complexity.
#Generate some data
set.seed(235)
data = rnorm(n = 100, mean = 0, sd = 1)
list_of_matrices = list(matrix(data = data[1:25], ncol = 5, nrow = 5),
matrix(data = data[26:50], ncol = 5, nrow = 5),
matrix(data = data[51:75], ncol = 5, nrow = 5),
matrix(data = data[76:100], ncol = 5, nrow = 5))
#Get the quantiles (5 quantiles here)
Q <- apply(simplify2array(list_of_matrices), 1:2, quantile, prob = c(seq(0,1,length = 5)))
#dim(Q)
#Q should have dims nquantiles by nfeatures by nfeatures
#Generate some other matrix M (true-data)
M = matrix(data = rnorm(n = 25, mean = 0, sd = 1), nrow = 5, ncol = 5)
#Loop through rows and columns in M to find which index of the array matches up closest with element M[i,j]
results = matrix(data = NA, nrow = 5, ncol = 5)
for (i in 1:nrow(M)) {
for (j in 1:ncol(M)) {
true_value = M[i,j]
#Subset Q to the ith and jth element (vector of nqauntiles)
quantiles = Q[,i,j]
results[i,j] = (which.min(abs(quantiles-true_value)))
}
}
'''

Applying a distance matrix to multiple data frames

I have 20 data frames of different lengths, but all the same number of columns. I would like to run some analyses, in this case a distance matrix using vegan, for each of these data frames. I have searched around and just figure I am missing a step somewhere.
dummy data is using 5 df, and I have been trying to use the lapply.
df1<- matrix(data = c(1:100), nrow = 10, ncol = 10)
df2<- matrix(data = c(1:150), nrow = 15, ncol = 10)
df3<- matrix(data = c(1:50), nrow = 5, ncol = 10)
df4<- matrix(data = c(1:200), nrow = 20, ncol = 10)
df5<- matrix(data = c(1:100), nrow = 10, ncol = 10)
Y<- list(df1, df2, df3, df4, df5)
Y.dc <- lapply(Y, dist.ldc(Y, "chord"))
I have also tried just running it on the list directly, and I keep getting errors there too.
Y.dc<- dist.ldc(Y, "chord")
Ideally, I would like to not run 20 lines/chunks of code for each frame.
Eventually, I would also like to be able to generate nMDS plots, and run PERMANOVAs on each of the data frames all at once as well. Would I need to write/run a function in order to accomplish that?
A valid syntax :
Y.dc <- lapply(Y, dist.ldc, method = "chord")
(I assumed function dist.lc came from package adespatial, which I don't know)

how can I set the bin centre values of histogram myself?

Lets say I have a data frame like below
mat <- data.frame(matrix(data = rexp(200, rate = 10), nrow = 100, ncol = 10))
Which then I can calculate the histogram on each of them columns using
matAllCols <- apply(mat, 2, hist)
Now if you look at matAllCols$breaks , you can see sometimes 11, sometimes 12 etc.
what I want is to set a threshold for it. for example it should always be 12 and the distances between each bin centre (which is stored as matAllCols$mids) be 0.01
Doing it for one column at the time seems to be simple, but when I tried to do it for all columns, it does not work. also this is only breaks, how to set the mids is also not straightforward
matAllCols <- apply(mat, 2, function(x) hist(x , breaks = 12))
is there anyway to do this ?
You can solve the probrem by giving the all breakpoints between histogram cells as breaks. (But this is written in stat.ethz.ch/R-manual/R-devel/library/graphics/html/hist.html as #Colonel Beauvel said)
set.seed(1); mat <- data.frame(matrix(data = rexp(200, rate = 10), nrow = 100, ncol = 10))
# You need to check the data range to decide the breakpoints.
range(mat) # [1] 0.002025041 0.483281274
# You can set the breakpoints manually.
matAllCols <- apply(mat, 2, function(x) hist(x , breaks = seq(0, 0.52, 0.04)))
You are looking for
set.seed(1)
mat <- data.frame(matrix(data = rexp(200, rate = 10), nrow = 100, ncol = 10))
matAllCols <- apply(mat, 2, function(x) hist(x , breaks = seq(0, 0.5, 0.05)))
or simply
x <- rexp(200, rate = 10)
hist(x[x>=0 & x <=0.5] , breaks = seq(0, 0.5, 0.05))

SAVE groups clustering to out of r

I wrote the follow code to clustering data :
clusrer.data <- function(data,n) {
miRNA.exp.cluster <- scale(t(miRNA.exp))
k.means.fit <- kmeans(miRNA.exp.cluster,n)
#i try to save the results of k-means cluster by this code :
k.means.fit <- as.data.frame(k.means.fit)
write.csv(k.means.fit, file="k-meanReslut.csv")
#x<-k.means.fit$clusters
#write.csv(x, file="k-meanReslut.csv")
}
but I can not save the clusters to outside of (clusters) (8, 6, 7, 20, 18), I want to save each cluster separated (with columns and rows) in txt file or CSV.
Here is one approach of splitting the original dataset according to cluster and saving that chunk to a file. I added cluster assignment to the original dataset for easier visual check. Example is partly taken from ?kmeans. Feel free to adapt the way files are written, as well as the way file name is created.
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
(cl <- kmeans(x, 2))
x <- cbind(x, cluster = cl$cluster)
by(x, INDICES = cl$cluster, FUN = function(sp) {
write.table(sp, file = paste0("file", unique(sp$cluster), ".txt"),
row.names = TRUE, col.names = TRUE)
})

Resources