Extend conditions in a dynamic way - r

I am trying to build a decision table. At time 3 for example I have to take the previous results in time t=1 and time t=2 in order to make my decision in time 3. The decision table is going to be pretty big so I am considering an efficient way to do it by building a function. For instance at time 3:
rm(list=ls()) # clear memory
names <- c("a","b","c","d","e","f","g","h","i","j","k50","l50","m50","n50","o50")
proba <- c(1,1,1,1,1,1,1,1,1,1,0.5,0.5,0.5,0.5,0.5)
need <- 4
re <- 0.5
w <- 1000000000
# t1
t1 <- as.integer(names %in% (sample(names,need,prob=proba,replace=F)))
# t2
t2 <- rep(t1)
# t3
proba3 <- ifelse(t2==1,proba*re,proba)
t3 <- as.integer(names %in% (sample(names,need,prob=proba3,replace=F)))
Now the table is going to be big until t=7 with proba7 which takes condition from t=1 to t=6. After t=7 it always takes the 6 previous outcomes plus the random part proba in order to make decision. In other words the ifelse must be dynamic in order that I can call it later. I have been trying something like
probF <- function(a){
test <- ifelse(paste0("t",a,sep="")==1,proba*re,proba)
return(test)
}
test <- probF(2)
but there is an error as I got just one value and not a vector. I know that it looks complicated
For the conditions requested by one person (i know it's not very good written) :
proba7 <- ifelse(t2==1 & t3==1 & t4==0 & t5==0 & t6==0,proba,
ifelse(t2==1 & t3==0 & t4==0 & t5==1 & t6==1,proba*re,
ifelse(t2==1 & t3==0 & t4==0 & t5==0 & t6==1, w,
ifelse(t2==0 & t3==1 & t4==1 & t5==0 & t6==0,proba,
ifelse(t2==0 & t3==1 & t4==1 & t5==1 & t6==0,0,
ifelse(t2==0 & t3==0 & t4==1 & t5==1 & t6==1,0,
ifelse(t2==0 & t3==0 & t4==1 & t5==1 &t6==0,0,
ifelse(t2==0 & t3==0 & t4==0 & t5==1 & t6==1, proba*re,
ifelse(t2==0 & t3==0 & t4==0 & t5==0 & t6==1,w,proba)))))))))
t7 <- as.integer(names %in% (sample(names,need,prob=proba7,replace=F)))

If you take a bit of a different approach, you'll gain quite a lot of speed.
First of all, it is really a terribly bad idea to store every step as a separate t1, proba1, etc. If you need to keep all that information, predefine a matrix or list of the right size and store everything in there. That way you can use simple indices instead of having to resort to the bug-prone use of get(). If you find yourself typing get(), almost always it's time to stop and rethink your solution.
Secondly, you can use a simple principle to select the indices of the test t:
seq(max(0, i-7), i-1)
will allow you to use a loop index i and refer to the 6 previous positions if they exist.
Thirdly, depending on what you want, you can reformulate your decision as well. If you store every t as a row in the matrix, you can simply use colSums() and check whether that one is larger than 0. Based on that index, you can update the probabilities in such a way that a 1 in any of the previous 6 rows halfs the probability.
wrapping everything in a function would then look like :
myfun <- function(names, proba, need, re,
w=100){
# For convenience, so I don't have to type this twice
resample <- function(p){
as.integer(
names %in% sample(names,need,prob=p, replace = FALSE)
)
}
# get the number of needed columns
nnames <- length(names)
# create two matrices to store all the t-steps and the probabilities used
theT <- matrix(nrow = w, ncol = nnames)
theproba <- matrix(nrow = w, ncol = nnames)
# Create a first step, using the original probabilities
theT[1,] <- resample(proba)
theproba[1,] <- proba
# loop over the other simulations, each time checking the condition
# recalculating the probability and storing the result in the next
# row of the matrices
for(i in 2:w){
# the id vector to select the (maximal) 6 previous rows. If
# i-6 is smaller than 1 (i.e. there are no 6 steps yet), the
# max(1, i-6) guarantees that you start minimal at 1.
tid <- seq(max(1, i-6), i-1)
# Create the probability vector from the original one
p <- proba
# look for which columns in the 6 previous steps contain a 1
pid <- colSums(theT[tid,,drop = FALSE]) > 0
# update the probability vector
p[pid] <- p[pid]*0.5
# store the next step and the used probabilities in the matrices
theT[i,] <- resample(p)
theproba[i,] <- p
}
# Return both matrices in a single list for convenience
return(list(decisions = theT,
proba = theproba)
)
}
which can be used as:
myres <- myfun(names, proba, need, re, w)
head(myres$decisions)
head(myres$proba)
This returns you a matrix where every row is one t-point in the decision table.

Related

Quickest way to change DNA character string based on quality information (both imported from BAM file) in R

I am attempting to alter the characters in a DNA character string based on the associated quality information. There are a number of tools I have seen for quality filtering (cutadapt, trimmomatic among others), but none of them seem to do prescisely what I want.
I want to convert the basecall to another value (using "Y" in below script for no particular reason) based on a quality threshold. The purpose of this is to exclude that base from downstream analysis, which depends on the presence of A / T / G / C / -.
The quality information is imported from the BAM file (using GenomicAlignments + ScanBam) and I am using Rsubread to convert this to Q-scores.
I have the below for loop working fine, but it takes quite a while on large data sets (~100k reads). Are there ways to maintain a similar output but in a more efficient manner, maybe with apply? I don't know how to turn this into a parallel computation (it seems to only use one core).
Dummy data included for completeness.
#Make dummy sequence data frame
seq1 <- "CCGGGGGATCCGGAGGCACCGGCGGCGGGTCCGGCGGCGGCAGTAATACGACTCACTATAGGGGCACCGGTGGACTGTTCGAGGCCATCGAGGGATTCATCGAAAACGGATGGGAAGGCATGATCGACGGCTGGTACGGCTTTAGGTGACACCAGAACGCCCAGGGCGAGGGCACGGCCGCTTAATAGGCGGCCGCGACTCTAGATCATAATCAGCCATACCACAT"
seq1DF <- data.frame(seq1)
seq2 <- "CCGGGGGATCCGGAGGCACCGGCGGCGGGTCCGGCGGCGGCAGTAATACGACTCACTATAGGGGCACCGGTGGACTGTTCGAGGCCATCGAGGGATTCATCGAAAACGGATGGGAAGGCATGATCGACGGCTGGTACGGCTTTAGGTGACACCAGAACGCCCAGGGCGAGGGCACGGCCGCTTAATAGGCGGCCGCGACTCTAGATCATAATCAGCCATACCACAT"
seq2DF <- data.frame(seq2)
seqDF <- as.data.frame(rbind(seq1,seq2))
#Make dummy quality data frame
quality1 <- data.frame(t(c(replicate(220, 40),30:25)))
quality2 <- data.frame(t(c(replicate(220, 40),35:30)))
qualityDF <- as.data.frame(rbind(quality1,quality2))
#Define quality threshold
threshold <- 30
#number of rows in seqDF changes depending on the number of reads
#some reads have slightly longer lengths, so check for the max nchar and go over all sequences to this extent
#the quality information can contain NA where there is no sequencing data, include !is.na to circumvent this
for (x in c(1:nrow(seqDF))){
for (n in c(1:max(nchar(seqDF[,1])))) {
if(qualityDF[x,n] <= threshold && !is.na(qualityDF[x,n])){
substr(seqDF[x,1],n , n) <- "Y"
}
}
}
This should be more efficient. It's unnecessary to loop through the whole string, only the targeted position ( <= threshould ) needed to be replaced by "Y".
# seq to data.frame split as column
df1 <- data.frame(seq1 = strsplit(seq1,"")[[1]])
df2 <- data.frame(seq2 = strsplit(seq2,"")[[1]])
# NA will be skipped
df1[which(quality1 <= threshold),] <- "Y"
df2[which(quality2 <= threshold),] <- "Y"
# collapse from vector to string
seq1_convert = paste(df1$seq1, sep = '', collapse = '')
seq2_convert = paste(df2$seq2, sep = '', collapse = '')
Maybe the following is a little bit faster.
MNC <- max(nchar(seqDF[,1]))
for(i in c(1:nrow(seqDF))) {
for(j in which(!is.na(qualityDF[i,]) & qualityDF[i,] <= threshold)) {
substr(seqDF[i,1], j, j) <- "Y"
}
substr(seqDF[i,1], nchar(seqDF[i,1])+1, MNC)
}
Or you convert the data.frames to matrix.
#Convert to matrix
seqM <- strsplit(seqDF[,1], "", TRUE)
seqM <- simplify2array(lapply(tt, "[", seq_len(max(lengths(tt)))))
qualityM <- t(as.matrix(qualityDF))
#overwrite with `Y` where quality is below threshold
seqM[!is.na(qualityM) & qualityM <= threshold] <- "Y"
#In case needed - Convert it back to string
apply(seqM, 2, function(x) paste(x[!is.na(x)], collapse=""))

Multi Conditional Statements in R

I'd like to know the shape or length of the filtered dataframe through multiple conditions. I have 2 methods I've used, but I'm a little stumped because they're giving me different outputs.
Method 1
x <- df[df$gender=='male',]
x <- x[x$stat == 0,]
nrow(x)
OUTPUT = Some Number
Method 2
nrow(sqldf('SELECT * FROM df WHERE gender == "male" AND stat == 0'))
OUTPUT = Some Number
I'm a little confused as to why the outputs would be different? Any ideas?
It looks like in method one you assigned x to df[df$gender=='male'] and then you replace x with assigning it to x[x$stat == 0]. So you will end up with nrow for how many stat == 0 only. Off of the top of my head with no dataset, maybe x <- df[df$gender=='male' & x$stat == 0] would work. Although I have never done it this way. I would use the subset function with x <- subset(x, df$gender=='male' & x$stat == 0) and then nrow(x).

Vectorizing a for-loop containing an if-else statement in R

For my program, I have to write a script that identifies clusters based on the distance "locuslim", which is the maximum distance allowed. I got a slow for-loop working as desired, but I have some trouble vectorizing it to optimize the runtime.
"s" is the iterator for all strains being compared and "i" is to iterate over rows. Col is to create new columns, each strain getting a new column.
Especially the ifelse seems to be the issue...
The for-loop (as function):
clusterfunc <- function (TABLE11_4){
for (s in levels(Strains)){
message(s) # Just to see the progress while running
locus <- min(TABLE11_4$dist[TABLE11_4$dist>0 & TABLE11_4$Organism.q==s], na.rm=TRUE) # The minimal distance between two items
locuslim <- 3*locus # Maximal distance allowed
col <- paste("Cluster",s,sep=".")
TABLE11_4 <- TABLE11_4 %>% group_by(Organism.h) %>% arrange(Rev.gr) %>% mutate(col=ifelse(lead(dist)<=locuslim,1,2))
TABLE11_4 <- TABLE11_4 %>% group_by() %>% arrange(Rev.gr)
TABLE11_4$dist[is.na(TABLE11_4$dist)] <- 0 # Eliminate NAs by making them 0
c=1
for(i in 1:nrow(TABLE11_4)){
if(TABLE11_4$dist[i]<=locuslim){
TABLE11_4$col[i] <- c # Set clusternumber it belongs to
}
else{
TABLE11_4$col[i] <- c+1 # Set next clusternumber it belongs to
c <- c+1 # Go to next cluster
}
}
setnames(TABLE11_4,"col",col)
}
return(TABLE11_4)
}
Thank you in advance!

speeding up boolean logic loop in R

I am very new to R but I am interested in learning more and improving.
I have a dataset with around 40,000+ rows containing the length of neuron segments. I want to compare the length trends of neurons of different groups. The first step in this analysis involves sorting the measurements into 1 of 6 different categories such as '<10' '10-15', '15-20', '20-25', '25-30', and '>30'.
I created these categories as appended columns using 'mutate' from the 'dplyr' package and now I am trying to write a boolean function to determine where the measurement fits by applying a value of '1' to the corresponding column if it fits, and a '0' if it doesn't.
Here is what I wrote:
for (i in 1:40019) {
{if (FinalData$Length[i] <=10)
{FinalData$`<10`[i]<-1
} else {FinalData$`<10`[i]<-0}} #Fills '<10'
if (FinalData$Length[i] >=10 & FinalData$Length[i]<15){
FinalData$`10-15`[i]<-1
} else{FinalData$`10-15`[i]<-0} #Fills'10-15'
if (FinalData$Length[i] >=15 & FinalData$Length[i]<20){
FinalData$`15-20`[i]<-1
} else{FinalData$`15-20`[i]<-0} #Fills '15-20'
if (FinalData$Length[i] >=20 & FinalData$Length[i]<25) {
FinalData$`20-25`[i]<-1
} else{FinalData$`20-25`[i]<-0} #Fills '20-25'
if(FinalData$Length[i] >=25 & FinalData$Length[i]<30){
FinalData$`25-30`[i]<-1
} else{FinalData$`25-30`[i]<-0} #Fills '25-30'
if(FinalData$Length[i] >=30){
FinalData$`>30`[i]<-1
} else{FinalData$`>30`[i]<-0} #Fills '>30'
}
This seems to work, but it takes a long time:
system.time(source('~/Desktop/Home/Programming/R/Boolean Loop R.R'))
user system elapsed
94.408 19.147 118.203
The way I coded this seems very clunky and inefficient. Is there a faster and more efficient way to code something like this or am I doing this appropriately for what I am asking for?
Here is an example of some of the values I am testing:
'Length': 14.362, 12.482337, 8.236, 16.752, 12.045
If I am not being clear about how the dataframe is structured, here is a screenshot:
How my data frame is organized
You can use the cut function in R. It is used to convert numeric values to factors:
x<-c(1,2,4,2,3,5,6,5,6,5,8,0,5,5,4,4,3,3,3,5,7,9,0,5,6,7,4,4)
cut(x = x,breaks = c(0,3,6,9,12),labels = c("grp1","grp2","grp3","grp4"),right=F)
set right = "T" or "F" as per your need.
You can vectorise that as follows (I made a sample of some data called DF)
DF <- data.frame(1:40000,sample(letters,1:40000,replace=T),"Length"=sample(1:40,40000,replace=T))
MyFunc <- function(x) {
x[x >= 10 & x < 15] <- "10-15"
x[x >= 15 & x < 20] <- "15-20"
x[x >= 20 & x < 25] <- "20-25"
x[x >= 25 & x < 30] <- "25-30"
x[x > 30] <- ">30"
x[x < 10] <- "<10"
return(x)
}
DF$Group <- MyFunc(DF[,3])
If it has to be 6 columns like that, you can modify the above to return a one or zero for the appropriate size and everything else, respectively, for each of the 6 columns.
Edit: I guess a series of ifelse might be best if it really has to be 6 columns like that.
e.g.
DF$'<10' <- sapply(DF$Length, function(x) ifelse(x < 10,1,0))

Splitting a data set using two parameters and saving the sub-data sets in a list

I am trying to split my data set using two parameters, the fraction of missing values and "maf", and store the sub-data sets in a list. Here is what I have done (it's not working). Any help will be appreciated,
Thanks.
library(BLR)
library(missForest)
data(wheat)
X2<- prodNA(X, 0.4) ### creating missing values
dim(X2)
fd<-t(X2)
MAF<-function(geno){ ## markers are in the rows
geno[(geno!=0) & (geno!=1) & (geno!=-1)] <- NA
geno <- as.matrix(geno)
## calc_Freq for alleles
n0 <- apply(geno==0,1,sum,na.rm=T)
n1 <- apply(geno==1,1,sum,na.rm=T)
n2 <- apply(geno==-1,1,sum,na.rm=T)
n <- n0 + n1 + n2
## calculate allele frequencies
p <- ((2*n0)+n1)/(2*n)
q <- 1 - p
maf <- pmin(p, q)
maf}
frac.missing <- apply(fd,1,function(z){length(which(is.na(z)))/length(z)})
maf<-MAF(fd)
lst<-matrix()
for (i in seq(0.2,0.7,by =0.2)){
for (j in seq(0,0.2,by =0.005)){
lst=fd[(maf>j)|(frac.missing < i),]
}}
It sounds like you want the results that the split function provides.
If you have a vector, "frac.missing" and "maf" is defined on the basis of values in "fd" (and has the same length as the number of rows in fd"), then this would provide the split you are looking for:
spl.fd <- split(fd, list(maf, frac.missing) )
If you want to "group" the fd values basesd on of maf(fd) and frac.missing within the bands specified by your for-loop, then the same split-construct may do what your current code is failing to accomplish:
lst <- split( fd, list(cut(maf(fd), breaks = seq(0,0.2,by =0.005) ,
include.lowest=TRUE),
cut(frac.missing, breaks = seq(0.2,0.7,by =0.2),
right=TRUE,include.lowest=TRUE)
)
)
The right argument accomodates the desire to have the splits based on a "<" operator whereas the default operation of cut presumes a ">" comparison against the 'breaks'. The other function that provides similar facility is by.
the below codes give me exactly what i need:
Y<-t(GBS.binary)
nn<-colnames(Y)
fd<-Y
maf<-as.matrix(MAF(Y))
dff<-cbind(frac.missing,maf,Y)
colnames(dff)<-c("fm","maf",nn)
dff<-as.data.frame(dff)
for (i in seq(0.1,0.6,by=0.1)) {
for (j in seq(0,0.2,by=0.005)){
assign(paste("fm_",i,"maf_",j,sep=""),
(subset(dff, maf>j & fm <i))[,-c(1,2)])
} }

Resources