R loop until condition is met by changing input data - r

I have a Rna-seq dataset (rows=samples, columns=genes) which goes into a clustering function. The genes are grouped into n clusters labelled with numbers where the ones belonging to cluster 0 are the non-clustered genes. The ones clustered goes back into clustering and again we get n clusters labelled with numbers with 0 being again the non-clustered genes. The process goes ahead until no further genes are classified into cluster 0. I need to loop into through this process in order to return the final clustering result along with the union of the genes belonging to cluster 0 at each iteration. I'm aware this could be done with either while or repeat. I had a try using repeat but is not working and the problem is that I have not really clear how to set this up properly.
#define my dataset
dat<-my_dataset
repeat{
#run the clustering
aa<-cluster(dat)
#if the cluster 0 has length 0 (no genes), assign clustering results to Mod and stop the loop
if (length(which(aa$colors==0))==0){
Mod<-aa
break
#otherwise, store the genes belonging to cluster 0 and set up a new dataset made up by clustered genes
} else{
noPass<-rownames(dat)[aa$colors==0]
dat<-dat[,which(aa$colors!=0)]
}
return(list(aa,noPass))
}
Any suggestion is really appreciated.

This sounds like a suitable problem for recursion:
define a function that will cluster the elements in its argument; if there are none left-out from the clustering, then return the elements, otherwise call itself using only the elements that are still in, and so on until there are none left-out.
cluster_until_none_left_out = function( elements ) {
aa=cluster(data[elements])
if (sum(aa$colors==0)==0) { return( list(aa,elements) ) }
else{ return( cluster_until_none_left_out(elements[aa$colors!=0]) ) }
}
It's a bit difficult to provide a full, running example without any real data, but here's a very simple one using mock data consisting of a vector of random integers, which a mock 'clustering' function splits into those that are greater than half the mean (the single 'cluster'), and those that are less than this (the left-out, 'unclustered' elements).
data=sample(1:100)
cluster = function(vec) {
answer=data.frame(vec)
answer$colors=ifelse(vec>mean(vec)/2,1,0)
return(answer)
}
sum(cluster(data)==0)
# shows that one round of clustering leaves some elements in cluster 0
initial_elements=1:100
clustering_elements=cluster_until_none_left_out(initial_elements)[[2]]
sum(cluster(data[clustering_elements])==0)
# 0 (now nobody left in cluster 0)
The final clustering is now returned together with the indices of the elements it uses. The 'left-out' elements (cluster 0) are all the rest.
cluster_0 = initial_elements[ -clustering_elements ]

Related

How to vectorise distance matrix calculations in R?

I have data that is in the form of a list stored in a string. There are around 7300 rows, and the lists are in the form "23.53.24.33.55" and so on.
Each of these numbers is a code for items and I'm calculating the Sorensen-Dice distance between the itemsets. There are 238 unique item codes.
I start with a dataset matrix_ob.csv which has 7300 rows and 239 columns where each list as shown above is onehotencoded into binary values indicating presence or absence of the item in the row (so in the example shown above there would be 233 zeroes and 5 ones for 23, 53,23,33 and 55). Initially I did this process using for loops but that worked earlier when the datasets were smaller, now I have so many rows that using a for loop seems incorrect. The distance matrix at the end would have dimensions of 7300x7300, which takes significantly longer to do.
I take the two lists that are being compared, and add the onehotencoded values. This means that each of the 238 binary values can take the values 0,1 or 2. 2 means the item code is present in both lists. If the itemsets are identical the distance is 0, if the itemsets are different, the distance is 1. The third ifelse case is when some of the items are present in both itemsets, for instance "23.25.27" and "23.53.24.33.55" have 1 item in common for both. I count the 0s, 1s and 2s and input them into the formula for the coefficient, and I get the Sorensen-Dice distance which is simply 1 minus the coefficient.
Currently this is the code I use:
sorensen_dice<-matrix(0, nrow = nrow(matrix_ob), ncol = nrow(matrix_ob))
for (i in c(1:nrow(sorensen_dice))) {
for (j in c(1:nrow(sorensen_dice))) {
if (i==j) {
sorensen_dice[i,j] = 0
}
else {
temp<-matrix_ob[i,]+matrix_ob[j,]
if (length(table(list(temp)))==2) {
sorensen_dice[i,j] = 1
}
else {
a<-2*(table(list(temp))[[3]])
bc<-table(list(temp))[[2]]
p<-a+bc
sorensen_dice[i,j] = 1-(a/p)
}
}
}
print(i)
}
colnames(sorensen_dice)<-data_ob$LIST
rownames(sorensen_dice)<-data_ob$LIST
write.csv(sorensen_dice,file="sorensen_dice_ob.csv")
Each iteration out of 7000 takes a minute to execute which I see from the print line. However I don't understand how to vectorise something like this, there are no functions for this distance measure in R. What would be a good way to vectorise my calculations here? I'm essentially going 7000 times on a nested for loop to do this. I thought that maybe I could do it by only restricting the for loops to do one half of the matrix since this is a symmetrical matrix. This however just feels like a paperthin solution when I don't really understand properly how to vectorise something that's in a nested for loop.
Alternatively, the philentropy package has the distance() function which has the sorensen distance as well as the dice distance as options, which is the sorensen-dice distance?

Simulate Steps Through a Markov Chain

I am trying to simulate a step through a Markov chain with the intent to loop through the procedure multiple times until a condition is met. (i.e., find out how many steps it takes, on average, to reach a specific state).
In this case, a state can only go one way. E.g., State 4 can transition forward to State 5, but cannot transition backward to State 3. This means the left-lower half of the transition matrix is set to zero. This is also why the method below puts arbitrarily large values in the 'prior' states. I attempt to find the correct new state by examining which probability in the specified row of the transition matrix is closest to a random number.
get_new_state <- function(current_state, trans_matrix)
{
# generate a random number between 0-1 to compare to the transition matrix probabilities
rand <- runif(1)
# transition to where the
# random number falls within the transition matrix
row <- current_state # initial condition determines the row of the trans_matrix
col = current_state # start in the column
# loop thru all columns and find the correct state
potential_states <- rep(0, each=ncol(trans_matrix)) # holds the value of the potential state it transitions to
# in this case, we can't transition to a previous state so we set the previous state values arbitrarily high
# so they don't get selected in the which.min() function later
potential_states[1:col] <- 999
for(k in row:ncol(trans_matrix)) # loop thru non-zero matrix values
{
if(trans_matrix[row,k] > rand)
{
potential_states[k] <- trans_matrix[row,k] / rand
potential_states[k] <- 1 - potential_states[k]
}
}
# new state is equal to the index of the lowest value
# lowest value = closest to random number
new_state = which.min(potential_states)
return(as.numeric(new_state))
}
I'm not sure if this approach is reasonable. I'm assuming there is a better way to simulate without the kluge that puts arbitrarily large values in potential_states[].
Would something like this work better (it is a one-line Markov transition):
> new_state <- sample(1:number_states, size=1,
prob=transition_matrix[old_state,])
and just put this in a (for instance) while() loop with a counter.

Using the rbinom function in R in a matrix

Preface: I am fairly novice at using R, I've used SAS my entire adult life and am not used to working with matrices either.
I am currently working on a project for an evolutionary biology class that requires running the rbinom function through a nested loop over a matrix.
The probability in the first row is set to 0.1 but then the value in subsequent rows must use the probability from the previous row. I cannot figure out how to reference the value in the previous row. My code is below, if anybody knows the syntax for this I would greatly appreciate it! Currently I have it set to i-1 but I know that's not right.
#equation = rbinom(1,2*N,p) / (2*N)
p<-0.1
N<-10
T<-5 #number generations
L<-3 #number independent SNP's
alleles<-matrix(nrow=T,ncol=L) #initialize a matrix of allele frequencies
each generation
alleles[1,]<-p #initialize first row to equal p
for (j in 1:ncol(alleles)) {
for (i in 2:nrow(alleles)) {
alleles[i,j]<-(rbinom(1,(2*N),(i-1))/(2*N))
}
}
alleles

R code for simulation of a of setup for picking tiles from a bag

The problem is there is box with 5 tiles numbered 1,2,3,4,5. I pick 2 tiles note the numbers and drop the tiles in the bag. And then I pick 2 tiles again and note the numbers. What is probability that there is no overlap between the numbers? Say got 1,4 the first time and then the second time I get 3,5. No overlap. The theoretical result is 3/10. But this simulation is keeps giving me an answer close to 0.5. Any insights about what I am doing wrong? Could it be sample function in R ?
I make a matrix with all possible pairs you could get with 5 tiles 1,2 1,3 etc and then generate two random numbers which give the row numbers. I assume these are the two draws of numbers and see if they are equal.
set.seed(1234)
n=10000
count=0
t<-cbind(c(1,1,1,1,2,2,2,3,3,4),c(2,3,4,5,3,4,5,4,5,5))
idx<-sample(1:10,2*n,replace=T)
i<-idx[1:n]
j<-idx[(n+1):(2*n)]
for( ii in 1:n) {
if( (t[i[ii],1] != t[j[ii],1]) && (t[i[ii],2] != t[j[ii],2]))
count=count+1
}
count/n
[1] 0.5004
Any insights will be helpful. I am sure the theoretical answer is 3/10
It's been awhile since I've used R so apologies if I'm a little rusty. Seems to me you're almost there. The problem is in your if statement within the for loop. You're testing whether the first number in the first pair is different from the first number in the second pair AND the second number in the first pair is different from the second number in the second pair. But you're forgetting about whether the first number in the first pair is different from the second number in the second pair AND the second number in the first pair is different from the first number in the second pair. Here's the full line:
if(
(t[i[ii],1] != t[j[ii],1]) &&
(t[i[ii],2] != t[j[ii],2]) &&
(t[i[ii],1] != t[j[ii],2]) &&
(t[i[ii],2] != t[j[ii],1])
) count=count+1
There might be other ways to accomplish this, but this seems to do the trick. I get about 0.3 for the result. And thanks for the opportunity to think about R again.
I would not use a loop. 10000 observations is not big enough to prevent you from building a data.frame with your samples. In the following code, I take samples twice and put it in a 10000 rows by 4 column object. I then identify which rows have duplicated picks. I then divide by your total number. The 1- is there because the code counts duplicateds. My result is in line with the theoretical number.
n <-10000
res <-cbind(t(replicate(n,sample(1:5,2,replace=FALSE))),t(replicate(n,sample(1:5,2,replace=FALSE))))
1-sum(apply(apply(res, 1, duplicated),2,any))/n
#[1] 0.2979

R: How to get a count for a certain value in a matrix row in R?

Ok I have the following problem:
I have several ranks in a matrix in r. (I've got this by ranking asset returns. Ranks>=3 get an NA, Ranks <3 get the rank number. If some assets share a rank, less NAs are in a row). Here are two example rows and two example rows of a matrix with returns.
ranks<-matrix(c(1,1,2,NA,NA, 1,2,NA,NA,NA),nrow=2,ncol=5)
returns<-matrix(c(0.3,0.1,-0.5,-0.7,0.2,0.1,0.4,0.05,-0.7,-0.3),nrow=2,ncol=5)
Now if all assets are equally bought for our portfolio, I can calculate the average return with:
Mat.Ret<-returns*ranks
Mean.Ret<-rowMeans(Mat.Ret,na.rm=TRUE)
However I want to have the option of giving a vector of weights for the two ranks and these weights say how big of a percentage this particular rank should have in my portfolio. As an example we have a vector of
weights<-c(0.7,0.3)
Now how would I use this in my code? I want to calculate basically ranks*returns*weights. If only ONE rank 1 and ONE rank 2 are in the table, the code works. But how would I do this variable? I mean a solution would be to calculate for each rank how many times it exists in a particular row and then divide the weight by this count. And then I would multiply this "net weight" * rank * returns.
But I have no clue how to do this in code..any help?
UPDATE AFTER FIRST COMMENTS
Ok I want to Keep it flexible to adjust the weights depending on how many times a certain rank is given. A user can choose the top 5 ranked assets, so none or several assets may share ranks. So the distribution of weights must be very flexible. I've programmed a formula which doesn't work yet since I'm obviously not yet experienced enough with the whole matrix and vector selection syntax I guess. This is what I got so far:
ranks<-apply(ranks,1,function(x)distributeWeightsPerMatrixRow(x,weights))
distributeWeightsPerMatrixRow<-function(MatrixRow,Weights){
if(length(Weights)==length(MatrixRow[!is.na(MatrixRow)])){
MatrixRow <- Weights[MatrixRow]
} else {
for(i in 1:length(MatrixRow)){
if(!is.na(MatrixRow[i])){
EqWeights<-length(MatrixRow[MatrixRow==MatrixRow[i]])
MatrixRow[i]<-sum(Weights[MatrixRow[i]:(MatrixRow[i]+EqWeights-1)])/EqWeights
}
}
}
return(MatrixRow)
}
EDIT2:
Function seems to work, however now the resulting ranks object is the transposed version of the original matrix without the column names..
Since your ranks are integers above zero, you can use this matrix for indexing the vector ranks:
mat.weights <- weights[ranks]
mat.weighted.ret <- returns * ranks * mat.weights
Update based on comment.
I suppose you're looking for something like this:
if (length(unique(na.omit(as.vector(ranks)))) == 1)
mat.weights <- (!is.na(ranks)) * 0.5
else
mat.weights <- weights[ranks]
mat.weighted.ret <- returns * ranks * mat.weights
If there is only one rank. All weights become 0.5.

Resources