Randomization macro - r

I try to measure the phylogenetic signal in two variables, a discrete and a continuous one. To do so, I use the δ-statistic (Borges et al 2018) and the K-statistic (Blomberg 2003), respectively. I have a tree, and two vectors corresponding to my variables. The line of code for these two statistics are the following:
1) delta(vector, tree, lambda0, se, sim, thin, burn)
2) phylosig(tree, vector, method = "K")
I get a single value each time. But I would like to randomize my vectors to test the significance of the orignal values. I would like to do 1000 repetitions and then proceed to a simple test of significance but, as I am a new R user, I have no idea how to do it. I think of something like this:
For δ:
%first repetition
random_vector <- sample(vector)
random_delta <- delta(vector, tree, lambda0, se, sim, thin, burn)
write.xlsx(random_delta, path)
%second repetition
random_vector <- sample(random_vector)
random_delta <- delta(vector, tree, lambda0, se, sim, thin, burn)
write.xlsx(random_delta, path, append = TRUE)
And on, and on, until 1000 δ-statistics stored in a single .xlsx, ready to be used in a test.
For K, I guess it is a bit different since it is not a vector anymore but a table with two columns (species, values):
%first repetition
random_vector <- sample(vector)
names(random_vector) <- tree$tip.label
random_K <- phylosig(tree, vector, method = "K")
write.xlsx(random_K, path)
%second rep
random_vector <- sample(random_vector)
names(random_vector) <- tree$tip.label
random_K <- phylosig(tree, vector, method = "K")
write.xlsx(random_delta, path, append = TRUE)
Etc.
I thought of that, but maybe someone has another idea. Either way, I am in.
I hope I have made myself clear.
EDIT:
Thank you for your answers. Yes, a loop is totally what I need. And yes, write all the values in one vector seems more approriate, I agree with you.
With the δ-statistic, the phylogenetic signal is all the more important in the data as δ is high. But what is high? That is why I want to do 1000 iterations, to calcultate the p-value and demonstrate if the original value is 'exceptional' or not. Same with K, which is comprised between 0 and 1 in the presence of a phylogenetic signal.
Here is a more explicite example:
> library(phytools)
> trait_delta <- c(2,1,3,1,1,3,1,3,2,1,1,2,2,2,2,1,1,3,1,1)
> trait_K <- c(2,1,3,1,1,3,1,3,2,1,1,2,2,2,2,1,1,3,1,1)
> set.seed(25)
> ns <- 20
> tree <- rtree(ns)
> plot(tree)
>
> #delta
> lambda0 <- 0.1
> se <- 0.5
> sim <- 10000
> thin <- 10
> burn <- 100
>
> delta <- delta(trait_delta,tree,lambda0,se,sim,thin,burn)
> rand_values_delta <- c(print(delta))
>
> #to loop 999 times
> rand_trait_delta <- sample(trait_delta)
> rand_delta <- delta(rand_trait_delta,tree,lambda0,se,sim,thin,burn)
> rand_values_delta <- append(rand_values_delta, print(rand_delta), after =
> length(rand_values_delta+1))
>
>
> #K
> names(trait_K) <- tree$tip.label
> K <- phylosig(tree, trait_K, method = "K")
> rand_values_K <- c(K)
>
> #to loop 999 times
> rand_trait_K <- sample(trait_K)
> names(rand_trait_K) <- tree$tip.label
> rand_K <- phylosig(tree, rand_trait_K, method = "K")
> rand_values_K <- append(rand_values_K, rand_K, after =
> length(rand_values_K+1))

I'm still a bit hazy on exactly what you are doing, but hopefully this points you in the right direction:
n <- 10000 # run this many iterations
results <- rep(NA, n) # create an empty vector to store all the values
for (i in 1:n){
# get the random vector for this iteration
random_vector <- sample(vector)
# save the value you output
results[i] <- delta(vector, tree, lambda0, se, sim, thin, burn)
}
# write the single vector to file
write.xlsx(results, path)
This is untested as I don't know what vector or path are, I just used them from your code

Related

input k-means in R

I'm trying to perform k-means on a dataframe with 69 columns and 1000 rows. First, I need to decide upon the optimal numbers of clusters first with the use of the Davies-Bouldin index. This algorithm requires that the input should be in the form of a matrix, I used this code first:
totalm <- data.matrix(total)
Followed by the following code (Davies-Bouldin index)
clusternumber<-0
max_cluster_number <- 30
#Davies Bouldin algorithm
library(clusterCrit)
smallest <-99999
for(b in 2:max_cluster_number){
a <-99999
for(i in 1:200){
cl <- kmeans(totalm,b)
cl<-as.numeric(cl)
intCriteria(totalm,cl$cluster,c("dav"))
if(intCriteria(totalm,cl$cluster,c("dav"))$davies_bouldin < a){
a <- intCriteria(totalm,cl$cluster,c("dav"))$davies_bouldin }
}
if(a<smallest){
smallest <- a
clusternumber <-b
}
}
print("##clusternumber##")
print(clusternumber)
print("##smallest##")
print(smallest)
I keep on getting this error:(list) object cannot be coerced to type 'double'.
How can I solve this?
Reproducable example:
a <- c(0,0,1,0,1,0,0)
b <- c(0,0,1,0,0,0,0)
c <- c(1,1,0,0,0,0,1)
d <- c(1,1,0,0,0,0,0)
total <- cbind(a,b,c,d)
The error is coming from cl<-as.numeric(cl). The result of a call to kmeans is an object, which is a list containing various information about the model.
Run ?kmeans
I would also recommend you add nstart = 20 to your kmeans call. k-means clustering is a random process. This will run the algorithm 20 times and find the best fit (i.e. for each number of centers).
for(b in 2:max_cluster_number){
a <-99999
for(i in 1:200){
cl <- kmeans(totalm,centers = b,nstart = 20)
#cl<-as.numeric(cl)
intCriteria(totalm,cl$cluster,c("dav"))
if(intCriteria(totalm,cl$cluster,c("dav"))$davies_bouldin < a){
a <- intCriteria(totalm,cl$cluster,c("dav"))$davies_bouldin }
}
if(a<smallest){
smallest <- a
clusternumber <-b
}
}
This gave me
[1] "##clusternumber##"
[1] 4
[1] "##smallest##"
[1] 0.138675
(tempoarily changing max clusters to 4 as reproducible data is a small set)
EDIT Integer Error
I was able to reproduce your error using
a <- as.integer(c(0,0,1,0,1,0,0))
b <- as.integer(c(0,0,1,0,0,0,0))
c <- as.integer(c(1,1,0,0,0,0,1))
d <- as.integer(c(1,1,0,0,0,0,0))
totalm <- cbind(a,b,c,d)
So that an integer matrix is created.
I was then able to remove the error by using
storage.mode(totalm) <- "double"
Note that
total <- cbind(a,b,c,d)
totalm <- data.matrix(total)
is unnecessary for the data in this example
> identical(total,totalm)
[1] TRUE

Efficient algorithm to turn matrix subdiagonal to columns r

I have a non-square matrix and need to do some calculations on it's subdiagonals. I figure out that the best way is too turn subdiagonals to columns/rows and use functions like cumprod. Right now I use a for loop and exdiag defined as below:
exdiag <- function(mat, off=0) {mat[row(mat) == col(mat)+off]}
However it to be not really efficient. Do you know any other algorithm to achieve that kind of results.
A little example to show what I am doing:
exdiag <- function(mat, off=0) {mat[row(mat) == col(mat)+off]}
mat <- matrix(1:72, nrow = 12, ncol = 6)
newmat <- matrix(nrow=11, ncol=6)
for (i in 1:11){
newmat[i,] <- c(cumprod(exdiag(mat,i)),rep(0,max(6-12+i,0)))
}
Best regards,
Artur
The fastest but by far the most cryptic solution to get all possible diagonals from a non-square matrix, would be to treat your matrix as a vector and simply construct an id vector for selection. In the end you can transform it back to a matrix if you want.
The following function does that:
exdiag <- function(mat){
NR <- nrow(mat)
NC <- ncol(mat)
smalldim <- min(NC,NR)
if(NC > NR){
id <- seq_len(NR) +
seq.int(0,NR-1)*NR +
rep(seq.int(1,NC - 1), each = NR)*NR
} else if(NC < NR){
id <- seq_len(NC) +
seq.int(0,NC-1)*NR +
rep(seq.int(1,NR - 1), each = NC)
} else {
return(diag(mat))
}
out <- matrix(mat[id],nrow = smalldim)
id <- (ncol(out) + 1 - row(out)) - col(out) < 0
out[id] <- NA
return(out)
}
Keep in mind you have to take into account how your matrix is formed.
In both cases I follow the same logic:
first construct a sequence indicating positions along the smallest dimension
To this sequence, add 0, 1, 2, ... times the row length.
This creates the first diagonal. After doing this, you simply add a sequence that shifts the entire previous sequence by 1 (either down or to the right) until you reach the end of the matrix. To shift right, I need to multiply this sequence by the number of rows.
In the end you can use these indices to select the correct positions from mat, and return all that as a matrix. Due to the vectorized nature of this code, you have to check that the last subdiagonals are correct. These contain less elements than the first, so you have to replace the values not part of that subdiagonal by NA. Also here you can simply use an indexing trick.
You can use it as follows:
> diag1 <- exdiag(amatrix)
> diag2 <- exdiag(t(amatrix))
> identical(diag1, diag2)
[1] TRUE
In order to come to your result
amatrix <- matrix(1:72, ncol = 6)
diag1 <- exdiag(amatrix)
res <- apply(diag1,2,cumprod)
res[is.na(res)] <- 0
t(res)
You can modify the diag() function.
exdiag <- function(mat, off=0) {mat[row(mat) == col(mat)+off]}
exdiag2 <- function(matrix, off){diag(matrix[-1:-off,])}
Speed Test:
mat = diag(10, 10000,10000)
off = 4
> system.time(exdiag(mat,4))
user system elapsed
7.083 2.973 10.054
> system.time(exdiag2(mat,4))
user system elapsed
5.370 0.155 5.524
> system.time(diag(mat))
user system elapsed
0.002 0.000 0.002
It looks like that the subsetting from matrix take a lot of time, but it still performs better than your implementation. May be there are a lot of other subsetting approaches, which outperforms my solution. :)

How to write for loop when function increases with each iteration?

I am trying to estimate the probability of detecting animals from n.sites over multiple observation periods when animals are removed and detection changes in time and space. It works if I do something like this for 5 observation periods:
for(i in 1:nsites){
mu[i,1] <- p[i,1]
mu[i,2] <- p[i,2]*(1-p[i,1])
mu[i,3] <- p[i,3]*(1-p[i,1])*(1-p[i,2])
mu[i,4] <- p[i,4]*(1-p[i,1])*(1-p[i,2])*(1-p[i,3])
mu[i,5] <- p[i,5]*(1-p[i,1])*(1-p[i,2])*(1-p[i,3])*(1-p[i,4])
}
The probability at time 2 is dependent on the probability at time 1 and the probability at time 3 is dependent on the probabilities at times 1 and 2. If I were only doing this for 5 time periods it wouldn't be a big deal to write this out. But as I get 10, 15, 20+ time periods, it's is quite messy to write out. I feel like there should be a way to write this loop without typing out each step, but I just can't think of how to do it. Maybe additional indexing or other control statement or power function. If p[i] were the same in each jth observation (i.e. p[i,1] = p[i,2] = p[i,3], etc.) it would be:
p[i]*(1-p[i])^5
Any suggestions would be greatly appreciated.
This is BUGS language code. I work in R and sent the code to JAGS via the rjags package. BUGS, R, or pseudo code would suit my purposes.
Here is R code that would simulate the problem:
set.seed(123)
testp <- matrix(runif(108, 0.1, 0.5), 108, 5)
testmu <- matrix(NA, 108, 5)
for(i in 1:nsites){
testmu[i,1] <- testp[i,1]
testmu[i,2] <- testp[i,2]*(1-testp[i,1])
testmu[i,3] <- testp[i,3]*(1-testp[i,1])*(1-testp[i,2])
testmu[i,4] <- testp[i,4]*(1-testp[i,1])*(1-testp[i,2])*(1-testp[i,3])
testmu[i,5] <- testp[i,5]*(1-testp[i,1])*(1-testp[i,2])*(1-testp[i,3])*(1-testp[i,4])
}
Thanks for any help.
Dan
This really does look like a task well suited to R's Reduce:
testmu3 <- matrix(NA, 108, 5)
nsites = 108
np = 5
for (i in 1:nsites) {
testmu3[ i, ] <- Reduce( function(x,y) x*(1-y), testp[i, ],
accumulate=TRUE)
}
max(abs(testmu3-testmu))
[1] 0
The accumulate parameter creates a growing vector of intermediate results.
> testp[1, ]
[1] 0.215031 0.215031 0.215031 0.215031 0.215031
> Reduce( function(x,y) x*(1-y), testp[1, ], accumulate=TRUE)
[1] 0.21503101 0.16879267 0.13249701 0.10400605 0.08164152
#Frank's answer is cleaner (and faster, probably), but this will also work and might be a little easier to understand.
testmu2 <- matrix(NA, 108, 5)
nsites = 108
np = 5
for (i in 1:nsites) {
fac <- 1
testmu2[i,1] <- testp[i,1]
for (j in 2:np) {
fac <- fac * (1-testp[i,j-1])
testmu2[i,j] <- testp[i,j] * fac
}
}
max(abs(testmu2-testmu))
[1] 2.775558e-17
Here's one way:
testmu2 <- testp*t(apply(cbind(1,1-testp[,-5]),1,cumprod))
On my computer, they almost match:
> max(abs(testmu2-testmu))
[1] 2.775558e-17
I don't know about BUGS/JAGS, but the idea is to take the cumulative product of your 1-p matrix across its columns first, and then take p*result.

Splitting a data set using two parameters and saving the sub-data sets in a list

I am trying to split my data set using two parameters, the fraction of missing values and "maf", and store the sub-data sets in a list. Here is what I have done (it's not working). Any help will be appreciated,
Thanks.
library(BLR)
library(missForest)
data(wheat)
X2<- prodNA(X, 0.4) ### creating missing values
dim(X2)
fd<-t(X2)
MAF<-function(geno){ ## markers are in the rows
geno[(geno!=0) & (geno!=1) & (geno!=-1)] <- NA
geno <- as.matrix(geno)
## calc_Freq for alleles
n0 <- apply(geno==0,1,sum,na.rm=T)
n1 <- apply(geno==1,1,sum,na.rm=T)
n2 <- apply(geno==-1,1,sum,na.rm=T)
n <- n0 + n1 + n2
## calculate allele frequencies
p <- ((2*n0)+n1)/(2*n)
q <- 1 - p
maf <- pmin(p, q)
maf}
frac.missing <- apply(fd,1,function(z){length(which(is.na(z)))/length(z)})
maf<-MAF(fd)
lst<-matrix()
for (i in seq(0.2,0.7,by =0.2)){
for (j in seq(0,0.2,by =0.005)){
lst=fd[(maf>j)|(frac.missing < i),]
}}
It sounds like you want the results that the split function provides.
If you have a vector, "frac.missing" and "maf" is defined on the basis of values in "fd" (and has the same length as the number of rows in fd"), then this would provide the split you are looking for:
spl.fd <- split(fd, list(maf, frac.missing) )
If you want to "group" the fd values basesd on of maf(fd) and frac.missing within the bands specified by your for-loop, then the same split-construct may do what your current code is failing to accomplish:
lst <- split( fd, list(cut(maf(fd), breaks = seq(0,0.2,by =0.005) ,
include.lowest=TRUE),
cut(frac.missing, breaks = seq(0.2,0.7,by =0.2),
right=TRUE,include.lowest=TRUE)
)
)
The right argument accomodates the desire to have the splits based on a "<" operator whereas the default operation of cut presumes a ">" comparison against the 'breaks'. The other function that provides similar facility is by.
the below codes give me exactly what i need:
Y<-t(GBS.binary)
nn<-colnames(Y)
fd<-Y
maf<-as.matrix(MAF(Y))
dff<-cbind(frac.missing,maf,Y)
colnames(dff)<-c("fm","maf",nn)
dff<-as.data.frame(dff)
for (i in seq(0.1,0.6,by=0.1)) {
for (j in seq(0,0.2,by=0.005)){
assign(paste("fm_",i,"maf_",j,sep=""),
(subset(dff, maf>j & fm <i))[,-c(1,2)])
} }

mapply for row cor.test function

I am trying to use cor.test over the rows in 2 matrices, namely cer and par.
cerParCorTest <-mapply(function(x,y)cor.test(x,y),cer,par)
mapply,however, works on columns.
This issue has been discussed in Efficient apply or mapply for multiple matrix arguments by row . I tried that split solution (as below)
cer <- split(cer, row(cer))
par <- split(par, row(par))
and it results in the error (plus it is slow)
In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
data length is not a multiple of split variable
I also tried t(par) and t(cer) to get it running over the rows, but it results in the error
Error in cor.test.default(x, y) : not enough finite observations
The martices are shown below (for cer and same in par):
V1698 V1699 V1700 V1701
YAL002W(cer) 0.01860500 0.01947700 0.02043300 0.0214740
YAL003W(cer) 0.07001600 0.06943900 0.06891200 0.0684330
YAL005C(cer) 0.02298100 0.02391900 0.02485800 0.0257970
YAL007C(cer) -0.00026047 -0.00026009 -0.00026023 -0.0002607
YAL008W(cer) 0.00196200 0.00177360 0.00159490 0.0014258
My question is why transposing the matrix does not work and what is a short solution that will allow running over rows with mapply for cor.test().
I apologise for the long post and thanks in advance for any help.
I don't know what are the dimensions of your matrix , but this works fine for me
N <- 3751 * 1900
cer.m <- matrix(1:N,ncol=1900)
par.m <- matrix(1:N+rnorm(N),ncol=1900)
ll <- mapply(cor.test,
split(par.m,row(par.m)),
split(cer.m,row(cer.m)),
SIMPLIFY=FALSE)
this will give you a list of 3751 elements(the correlation for each row)
EDIT without split, you give the index of the row , this should be fast
ll <- mapply(function(x,y)cor.test(cer.m[x,],par.m[y,]),
1:nrow(cer.m),
1:nrow(cer.m),
SIMPLIFY=FALSE)
EDIT2 how to get the estimate value:
To get the estimate value for example :
sapply(ll,'[[','estimate')
You could always just program things in a for loop, seems reasonably fast on these dimensions:
x1 <- matrix(rnorm(10000000), nrow = 2000)
x2 <- matrix(rnorm(10000000), nrow = 2000)
out <- vector("list", nrow(x1))
system.time(
for (j in seq_along(out)) {
out[[j]] <- cor.test(x1[j, ], x2[j, ])
}
)
user system elapsed
1.35 0.00 1.36
EDIT: If you only want the estimate, I wouldn't store the results in a list, but a simple vector:
out2 <- vector("numeric", nrow(x1))
for (j in seq_along(out)) {
out2[j] <- cor.test(x1[j, ], x2[j, ])$estimate
}
head(out2)
If you want to store all the results and simply extract the estimate from each, then this should do the trick:
> out3 <- as.numeric(sapply(out, "[", "estimate"))
#Confirm they are the same
> all.equal(out2, out3)
[1] TRUE
The tradeoff is that the first method stores all the data in a list which may be useful for further processing vs a mroe simple method that only grabs what you initially want.

Resources