I'm new to R programming and I know I could write a loop to do this, but everything I read says that for simplicity its best to avoid loops and use apply instead.
I have a matrix and i would like to run this function on each element in the matrix.
cellresidue <- function(i,j){
result <- (cluster[i,j] - cluster.I[i,] - cluster.J[j,] - cluster.IJ)/(cluster.N*cluster.M)
return (result)
}
i= element row
j= element column
cluster.J is a matrix of column means
cluster.I is a matrix of row means
cluster.IJ is the mean of the entire matrix named cluster
What I can't figure out is how do I get the row and column of the element (I think should use row() and column col() functions) that mapply is working with and how do pass those arguments to mapply or apply?
There is no need for loops or *apply functions. You can just use plain matrix operations:
nI <- nrows(cluster)
nJ <- ncols(cluster)
cluster.I <- matrix(rowMeans(cluster), nI, nJ, byrow = FALSE)
cluster.J <- matrix(rowMeans(cluster), nI, nJ, byrow = TRUE)
cluster.IJ <- matrix( mean(cluster), nI, nJ)
residue.mat <- (cluster - cluster.I - cluster.J - cluster.IJ) /
(cluster.N * cluster.M)
(You did not explain what cluster.N and cluster.M are but I assume they are scalars)
It is not clear from your question what you are trying to do. It is best on this site to provide some mock data (preferably generated by the code, not pasted), and then show what form the end result should look like. It seems that the apply family is not what you seek.
Quick disambiguation between apply, sapply and mapply:
#providing data for examples
X=matrix(rnorm(9),3,3)
apply: apply a function to either columns (2) or rows (1) of a matrix or array
#here, sum by columns, same as colSums(X)
apply(X, 2, sum)
sapply: apply a function against (usually) a list of objects
#create a list with three vectors
mylist=list(1:4, 5:10, c(1,1,1))
#get the mean of each vector
sapply(mylist, mean)
#remove 2 to each element of X, same as c(X-2)
sapply(X, FUN=function(x) x-2)
mapply: a multivariate version of sapply, taking an arbitrary number of arguments. Never had much use of it… Some rock-bottom examples:
#same as c(1,2,3,4) + c(15,16,17,18)
mapply(sum, 1:4, 15:18)
#same as c(X+X), the vectorized matrix sum
mapply(sum, X, X)
Side note: It's perfectly ok to use loops in R; use whichever suits the best your thoughts. The issue is that if you have a "really big" number of iterations, this is where you could meet bottlenecks, depending on your patience. There are two solutions to this: rewrite your function in C/FORTRAN (and boost speed), or use built-in functions if applicable (which are, by the way, often writen in C or FORTRAN).
Related
How can we construct a block-diagonal matrix from a three-dimensional array in R? There are several possibilities when starting from a list of matrices (e.g., Reduce(magic::adiag, list_of_matrices)) or individual matrices (e.g., magic::adiag(matrix1, matrix2)). However, I could not find anything when we start with an array:
matrices <- array(NA, c(3,3,2))
matrices[,,1] <- diag(1,3)
matrices[,,2] <- matrix(rnorm(9), 3, 3)
Are there any efficient solutions for constructing the corresponding 9x9 block matrix or is it a better idea to just convert to a list and use magic::adiag? The latter seems relatively inefficient, especially when the number of matrices is large.
I guess converting to a list and using magic::adiag is the fastest way. Try the following lines of code which is rather short and I use frequently:
library(magic)
arr <- array(1:8, c(2,2,3))
do.call("adiag", lapply(seq(dim(arr)[3]), function(x) arr[ , , x]))
This essentially reduces to a one-liner but uses lists.
I am trying to analyse a dataframe using hierarchical clustering hclust function in R.
I would like to pass in a vector of p values I'll write beforehand (maybe something like c(5/4, 3/2, 7/4, 9/4)) and be able to have these specified as the different p value options with Minkowski distance when I use expand.grid. Ideally, when hyperparams is viewed, it would also be clear which value of p has been used for each minkowski, i.e. they should be labelled. So for example, where (if you run my code for hyperparams) there would currently just be one minkowski under Dists, for each of the methods in Meths, there would be, if I supplied the p vector as c(5/4, 3/2, 7/4, 9/4), now instead 4 rows for Minkowski distance: minkowski, p=5/4, minkowski, p=3/2, minkowski, p=7/4, minkowski, p=9/4 (or looking something like that, making the p values clear). Any ideas?
(Note: no packages please, only base R!)
Edit: I worded it poorly before, now rewritten. Let's take the following example instead:
acc <- function(x){
first = sum(x)
second = sum(x^2)
return(list(First=first,Second=second))
}
iris0 <- iris
iris1 <- cbind(log(iris[,1:4]),iris[5])
iris2 <- cbind(sqrt(iris[,1:4]),iris[5])
Now the important bit:
tests <- expand.grid(Dists=c("euclidean","maximum","manhattan","canberra","binary"),
DS=c("iris0","iris1","iris2"))
Table <- Map(function(x, ds){acc(table(ds$Species, cutree(hclust(dist(get(ds)[,1:4], method=x)),3)))},tests[[1]], tests[[2]])
This will work. But now if I want to include a term like "minkowski",p=3 in expand.grid, how would I do it?
tests <- expand.grid(Dists=c("euclidean","maximum","manhattan","canberra","binary","minkowski,p=3"),
DS=c("iris0","iris1","iris2"))
Table <- Map(function(x, ds){acc(table(ds$Species, cutree(hclust(dist(get(ds)[,1:4], method=x)),3)))},tests[[1]], tests[[2]])
This gives an error.
In reality there should be no p argument unless the method="minkowski". I have tried to use strsplit to get the first part of the expression into ds, and a switch with strsplit to get the second part and then use parse (it would return NULL if the length of the strsplit was not 2 -- this should pass no argument, I think). The issue seems to be that strsplit is not strsplit(x,",") fails to evaluate the vectorized x but rather tries to evaluate the character x which is not a string. Can anyone suggest any workaround/fix or other method for including the minkowski,p=1.6 terms and the like?
We can create a 'p' value column
tests <- expand.grid(Dists=c("euclidean","maximum","manhattan","canberra","binary",
"minkowski3", "minkowski4", "minkowski5"),
DS=c("iris0","iris1","iris2"))
Suppose, we have another column of 'p' values in 'tests', the above solution can be changed to
tests$p <- as.list(args(dist))$p # default value
i1 <- grepl("minkowski", tests$Dists)
tests$Dists <- sub("[0-9.]+$", "", tests$Dists)
tests$p[i1] <- rep(3:5, length.out = sum(i1))
Map(function(x, ds, p){
dist1 <- dist(get(ds)[, 1:4], method = x, p = p)
ct <- cutree(hclust(dist1), 3)
acc(table(get(ds)$Species, ct))},
as.character(tests[[1]]), as.character(tests[[2]]), tests$p )
I have this parameter:
L_inf <- seq(17,20,by=0.1)
and this function:
fun <- function(x){
L_inf*(1-exp(-B*(x-0)))}
I would to apply this function for a range of value of L_inf.
I tried with loop for, like this:
A <- matrix() # maybe 10 col and 31 row or vice versa
for (i in L_inf){
A[i] <- fun(1:10)
}
Bur R respond: longer object length is not a multiple of shorter object length.
My expected output is a matrix (or data frame, or list maybe) with 10 result (fun(1:10)) for each value of the vector L_inf (lenght=31).
How can to do it?
You are trying to put a vector of 10 elements into one of the matrix cell. You want to assign it to the matrix row instead (you can access the ith row with A[i,]).
But using a for loop in this case is inefficient and it is quite straightforward to use one of the "apply" function. Apply functions typically return a list (which is the most versatile container since there is basically no constraint).
Here sapply is an apply function which tries to Simplify its result to a convenient data structure. In this case, since all results have the same length (10), sapply will simplify the result to a matrix.
Note that I modified your function to make it explicitly depend on L_inf. Otherwise it will not do what you think it should do (see keyword "closures" if you want more info).
L_inf_range <- seq(17,20,by=0.1)
B <- 1
fun <- function(x, L_inf) {
L_inf*(1-exp(-B*(x-0)))
}
sapply(L_inf_range, function(L) fun(1:10, L_inf=L))
I am normally a maple user currently working with R, and I have a problem with correctly indexing variables.
Say I want to define 2 vectors, v1 and v2, and I want to call the nth element in v1. In maple this is easily done:
v[1]:=some vector,
and the nth element is then called by the command
v[1][n].
How can this be done in R? The actual problem is as follows:
I have a sequence M (say of length 10, indexed by k) of simulated negbin variables. For each of these simulated variables I want to construct a vector X of length M[k] with entries given by some formula. So I should end up with 10 different vectors, each of different length. My incorrect code looks like this
sims<-10
M<-rnegbin(sims, eks_2016_kasko*exp(-2.17173), 840.1746)
for(k in 1:sims){
x[k]<-rep(NA,M[k])
X[k]<-rep(NA,M[k])
for(i in 1:M[k]){x[k][i]<-runif(1,min=0,max=1)
if(x[k][i]>=0 & x[i]<=0.1056379){
X[k][i]<-rlnorm(1, 6.228244, 0.3565041)}
else{
X[k][i]<-rlnorm(1, 8.910837, 1.1890874)
}
}
}
The error appears to be that x[k] is not a valid name for a variable. Any way to make this work?
Thanks a lot :)
I've edited your R script slightly to get it working and make it reproducible. To do this I had to assume that eks_2016_kasko was an integer value of 10.
require(MASS)
sims<-10
# Because you R is not zero indexed add one
M<-rnegbin(sims, 10*exp(-2.17173), 840.1746) + 1
# Create a list
x <- list()
X <- list()
for(k in 1:sims){
x[[k]]<-rep(NA,M[k])
X[[k]]<-rep(NA,M[k])
for(i in 1:M[k]){
x[[k]][i]<-runif(1,min=0,max=1)
if(x[[k]][i]>=0 & x[[k]][i]<=0.1056379){
X[[k]][i]<-rlnorm(1, 6.228244, 0.3565041)}
else{
X[[k]][i]<-rlnorm(1, 8.910837, 1.1890874)
}
}
This will work and I think is what you were trying to do, BUT is not great R code. I strongly recommend using the lapply family instead of for loops, learning to use data.table and parallelisation if you need to get things to scale. Additionally if you want to read more about indexing in R and subsetting Hadley Wickham has a comprehensive break down here.
Hope this helps!
Let me start with a few remarks and then show you, how your problem can be solved using R.
In R, there is most of the time no need to use a for loop in order to assign several values to a vector. So, for example, to fill a vector of length 100 with uniformly distributed random variables, you do something like:
set.seed(1234)
x1 <- rep(NA, 100)
for (i in 1:100) {
x1[i] <- runif(1, 0, 1)
}
(set.seed() is used to set the random seed, such that you get the same result each time.) It is much simpler (and also much faster) to do this instead:
x2 <- runif(100, 0, 1)
identical(x1, x2)
## [1] TRUE
As you see, results are identical.
The reason that x[k]<-rep(NA,M[k]) does not work is that indeed x[k] is not a valid variable name in R. [ is used for indexing, so x[k] extracts the element k from a vector x. Since you try to assign a vector of length larger than 1 to a single element, you get an error. What you probably want to use is a list, as you will see in the example below.
So here comes the code that I would use instead of what you proposed in your post. Note that I am not sure that I correctly understood what you intend to do, so I will also describe below what the code does. Let me know if this fits your intentions.
# define M
library(MASS)
eks_2016_kasko <- 486689.1
sims<-10
M<-rnegbin(sims, eks_2016_kasko*exp(-2.17173), 840.1746)
# define the function that calculates X for a single value from M
calculate_X <- function(m) {
x <- runif(m, min=0,max=1)
X <- ifelse(x > 0.1056379, rlnorm(m, 6.228244, 0.3565041),
rlnorm(m, 8.910837, 1.1890874))
}
# apply that function to each element of M
X <- lapply(M, calculate_X)
As you can see, there are no loops in that solution. I'll start to explain at the end:
lapply is used to apply a function (calculate_X) to each element of a list or vector (here it is the vector M). It returns a list. So, you can get, e.g. the third of the vectors with X[[3]] (note that [[ is used to extract elements from a list). And the contents of X[[3]] will be the result of calculate_X(M[3]).
The function calculate_X() does the following: It creates a vector of m uniformly distributed random values (remember that m runs over the elements of M) and stores that in x. Then it creates a vector X that contains log normally distributed random variables. The parameters of the distribution depend on the value x.
I'm trying to formulate the code below using the lapply function (or actually the mclapply function) instead of the apply function. I want it to return a matrix or similar and not a list. The hi function is very complicated in my actual code, so I just presented a very basic example.
hi <- function(a, matrix) {
hi <- a[1] / a[2] * t(matrix) %*% matrix
return(hi)
}
a_1 <- t(matrix(1:4))
a_2 <- t(matrix(5:8))
choleski <- matrix(1:4)
result <- apply(rbind(a_1, a_2), 2, hi, matrix=choleski)
So my question is basically, how do I reformulate the code above using lapply instead of apply, i.e. apply the lapply function to the hi function instead of using the apply procedure. An efficient solution would be awesome.
Thanks.
If you aim to apply hi to each column of rbind(a_1, a_2), then you could do something like this:
A <- rbind(a_1, a_2)
sapply(seq_len(ncol(A)), function(i) hi(A[, i], choleski))
or, more simply (if it's ok to make rbind(a_1, a_2) a data.frame):
A <- as.data.frame(rbind(a_1, a_2))
sapply(A, function(x) hi(x, choleski))
The latter works since a data.frame is a type of list (where columns are its elements), and sapply applies function to list elements. By default, sapply simplifies its output if possible.