For loops in R are extremely slow but I know no alternative way of how to achieve the following.
As shown in this screenshot:
What I want the output format to look like:
> gene_id tss_id x y
in which, x = isosub$q1_FPKM / iso.agg$q1_FPKM // (correspond gene_id)
y = isosub$q2_FPKM / iso.agg$q2_FPKM
Here is my code with the for loop:
length = length(isosub$gene_id)
tmp = data.frame(isosub$gene_id, isosub$q1_FPKM, isosub$q2_FPKM)
j = 1
denominator_q1 = iso.agg$q1_FPKM[j]
denominator_q2 = iso.agg$q2_FPKM[j]
gene_id = isosub$gene_id
tmpq1 = tmp$isosub.q1_FPKM
tmpq2 = tmp$isosub.q2_FPKM
isoq1 = iso.agg$q1_FPKM
isoq2 = iso.agg$q2_FPKM
o2_q1 = rep(0, length)
o2_q2 = rep(0, length)
i = 0
for (i in 1:length){
if (gene_id[i+1] == gene_id[i]){
o2_q1[i] = tmpq1[i] / denominator_q1
o2_q2[i] = tmpq2[i] / denominator_q2
}else{
o2_q1[i] = tmpq1[i] / denominator_q1
o2_q2[i] = tmpq2[i] / denominator_q2
j = j + 1
denominator_q1 = isoq1[j]
denominator_q2 = isoq2[j]
}
}
when length = 1000, system.time shows that:
> user system elapsed
> 55.74 0.00 56.45
And my actual length is even larger: 13751.
Do you want to do a merge?
outdf <- merge(isosub[c("gene_id", "tss_id", "q1_FPKM", "q2_FPKM")],
iso.agg[c("gene_id", "q1_FPKM", "q2_FPKM")],
by="gene_id",
suffix=c(".1", ".2"))
outdf$x <- outdf$q1_FPKM.1 / outdf$q1_FPKM.2
outdf$y <- outdf$q2_FPKM.1 / outdf$q2_FPKM.2
If you ended up here looking for ways to avoid or speed up loops, check out this answer:
Speed up the loop operation in R
It helped me with a similar problem I was having, and shows ways to keep necessary loops but increase performance dramatically.
Related
I want to do the following operation with the code: I want to get a sample of n = 30 out of a given normal distribution and calculate the mean of each sample. (until this step my function works without any problem). After that I want to create a new vector with yes or no , dependent on if the mean is in a certain range or not. Sadly the code does notconduct this step. I always get a vector with 13 elements,but there should be 500. What is the problem? Where is my mistake?
o = 13
u = 7
d = c()
for (i in 1:500){
i = rnorm(30,mean = 10,sd = 6.04)
i = mean(i)
if (i <= o & i >=u) {
d[i]=("Yes")
} else {
d[i]=("No")
}
}
You should avoid changing the value of your iterator (i) within your loop. In your case, your i is becoming a non-integer value. When you try to index your d vector, it takes the integer portion of i.
Consider what happens when I have a vector
x <- 1:4
and I take the pi index of it.
x[pi]
# [1] 3
Your code should look more like this:
o = 13
u = 7
d = c()
for (i in 1:500){
sample_i = rnorm(30, mean = 10, sd = 6.04)
mean_i = mean(sample_i)
if (mean_i <= o & mean_i >=u) {
d[i]=("Yes")
} else {
d[i]=("No")
}
}
If you would like to improve your code some, here are some suggestions:
First, avoid "growing" your results. This has performance implications. It is better to decide how long your result (d) should be and set it to that length to begin with.
Next, try not to hard code the number of iterations into your loop. Get familiar with seq_along and seq_len and use them to count iterations for you.
o = 13
u = 7
d = numeric(500) # I made a change here
for (i in seq_along(d)){ # And I made a change here
sample_i = rnorm(30, mean = 10, sd = 6.04)
mean_i = mean(sample_i)
if (mean_i <= o & mean_i >=u) {
d[i]=("Yes")
} else {
d[i]=("No")
}
}
Re-assigning i looks like a bad idea to me.
Are you sure you want to do this in a for loop? If not, a vectorised solution with crossing (tidyverse - nice explanations at varianceexplained.org ) should work pretty nicely, I think?
o = 13
u = 7
crossing(trial = 1:500,
rounds = 1:30)%>%
mutate(num = rnorm(n(), mean = 10, sd = 6.04))%>%
group_by(trial)%>%
summarise(mean = mean(num))%>%
mutate(d = case_when(mean <= o & mean >= u ~ "Yes",
TRUE ~ "No"))%>%
count(d)
rm(list=ls())
library(gld)
library(GLDEX)
find_likelihood = function(x){
fit = fit.fkml(x,"ML",return.data=TRUE)
params = fit$optim.results$par
dens = dgl(x, c(params), param = "fmkl", inverse.eps = 1e-08, max.iterations = 500)
dens[dens < 1e-12] = 1e-12
value = -sum(log(dens))
return(value)
}
loglik_com = NULL
min_value = NULL
for(j in 1: 1000){
x = rgl(100,c(2, 1, 1, 1), param="fkml")
for (i in 2:(100 - 2)) {
a = x[1:i]
b = x[(i + 1):100]
loglik_com[i] = find_likelihood(a) + find_likelihood(b)
}
min_value[j] = min(na.omit(loglik_com))
}
min_value
In my above R function, I have 1000 iterations but it takes nearly 7 hours to get results. My question is, am I able to use apply() function to speed up the results? Or any other way that I can use to speed up my r function?
Thank you in advance.
You could try some sort of parallelisation, I see there is the Parallel package? Have not had any personal experience however.
https://www.rdocumentation.org/packages/parallel/versions/3.6.2
https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/parallel.html
I am trying to make a simple "model" of loss of an initial value based on a number of criteria, as seen in my code below:
mtDNAlen = 16299
copies = 10
mito = 50
fraction = copies*mtDNAlen*0.75
Rounds = 40
A = data.frame(
Length = c(rep(mtDNAlen,copies))
)
data = list()
for (i in 1:mito){
data[[i]]=list()
}
for (i in 1:length(data)){
for (j in 1:(Rounds+1)){
data[[i]][[j]]=data.frame(A)
}}
finaldata = data.frame(X = seq(1,copies,1))
random.sample = function(x) {
x = sample(sample.data, copies,
prob=Prob,
replace=FALSE)
if (sum(x) > fraction) return(x)
Recall(x)
}
for (i in 1:length(data)){
for(j in 1:Rounds){
data[[i]][[j]]$Deletion = sample(c("Yes","No"), nrow(data[[i]][[j]]), prob=c(0.05,0.95), replace=TRUE)
data[[i]][[j]]$DelLength = ifelse(data[[i]][[j]]$Deletion == "Yes", sample(seq(0,15000,1), replace = TRUE),0)
data[[i]][[j]]$Length = data[[i]][[j]]$Length - data[[i]][[j]]$DelLength
data[[i]][[j]] = data[[i]][[j]][data[[i]][[j]]$Length > 2000,]
data[[i]][[j]] = rbind(data[[i]][[j]],data[[i]][[j]])
Prob = c(rep(16300,nrow(data[[i]][[j]]))) - data[[i]][[j]]$Length
Prob = Prob / sum(Prob)
sample.data = c(data[[i]][[j]]$Length)
data[[i]][[j+1]]$Length = random.sample(sample.data)
finaldata[[i]] = data[[i]][[j+1]]$Length
}
}
However, when I try to run my code I get the following error:
Error: C stack usage 7969700 is too close to the limit
After searching, this seems to occur when using recursive functions. However, I am new to these types of functions, and I don't know how to optimise my code further, in order to et rig of this error.
I should mention, that when I run my code without prob=Prob in the random.sample function, there is no error. So I guess the error comes from having to redo the random.sample so many times? Can I do this in a better way to avoid the C stack error? I haven't been able to find an alternative myself.
Finally, if I set Rounds and mito to very small values, I can complete the calculation, but it is not really usable for me...
Thanks!
EDIT
I tried to switch to using repeat instead, making my for loop look like this:
for (i in 1:length(data)){
for(j in 1:Rounds){
data[[i]][[j]]$Deletion = sample(c("Yes","No"), nrow(data[[i]][[j]]), prob=c(0.05,0.95), replace=TRUE)
data[[i]][[j]]$DelLength = ifelse(data[[i]][[j]]$Deletion == "Yes", sample(seq(0,15000,1), replace = TRUE),0)
data[[i]][[j]]$Length = data[[i]][[j]]$Length - data[[i]][[j]]$DelLength
data[[i]][[j]] = data[[i]][[j]][data[[i]][[j]]$Length > 2000,]
data[[i]][[j]] = rbind(data[[i]][[j]],data[[i]][[j]])
Prob = c(rep(16300,nrow(data[[i]][[j]]))) - data[[i]][[j]]$Length
Prob = Prob / sum(Prob)
sample.data = c(data[[i]][[j]]$Length)
repeat {
v2 <- sample(sample.data, copies,
prob=Prob,
replace=FALSE)
if( sum(v2) > fraction )
break
}
return(v2)
data[[i]][[j+1]]$Length = v2
finaldata[[i]] = data[[i]][[j+1]]$Length
}
}
However, now I can't get the sampled data to go to the next dataframe, ie. the line data[[i]][[j+1]]$Length = v2 seems to not be working. I can see v2 is getting generated and it looks to have the appropriate form and data stored...
Suppose I have repeated simulated data (100 times). Then, suppose that I would like to apply one function on each of these data. Since my data is repeated (sometimes 1000 times) I would like to know at which data my code working at this moment. That is, I would like my code to show the number of each data it is working on it. For example, when my code start with first data, then I would like it to let me know this is the first data. Then, the same for the second data and so on. I know that I will get the number of my data in console as a list. However, my function is much more complicated. This is just a simple example to explain my problem. I would like my code to let me know at which data it is working now.
This is my code:
N.a = 186; N.b = 38; N.ab=13; N.o = 284
## 1) numerical optimization
llk = function(xpar){
tmp = exp(c(xpar,0))
pr = tmp/sum(tmp) ## A/B/O
res1 = N.a*log(pr[1]^2+2*pr[1]*pr[3]) + N.b*log(pr[2]^2+2*pr[2]*pr[3])
res2 = N.ab*log(2*pr[1]*pr[2]) + N.o*log(pr[3]^2)
-res1-res2
}
pr = rep(1/3,3) ## A/B/O
it = 0; pdiff = 1
while( (it<100)&(pdiff>1e-5) ){
tmp = c(pr[1]^2, 2*pr[1]*pr[3])
tmp = tmp/sum(tmp)
N.aa = N.a*tmp[1]
N.ao = N.a*tmp[2]
tmp = c(pr[2]^2, 2*pr[2]*pr[3])
tmp = tmp/sum(tmp)
N.bb = N.b*tmp[1]
N.bb = N.b*tmp[1]
N.bo = N.b*tmp[2]
pr1 = c(2*N.aa+N.ao+N.ab, 2*N.bb+N.bo+N.ab, N.ao+N.bo+2*N.o)
pr1 = pr1/sum(pr1)
pdiff = mean(abs(pr1-pr))
it = it+1
pr = pr1
cat(it, pr, "\n")
}
How I can use cat function. For example, how to use this in my code:
cat(paste0("data: ", i, "\n"))
just a quick question to anyone that might know,
i have the following code in R pricing an option:
X = 1.05
r = .85
n = 250
nsim = 2000
ctot = 0
for( i in 1:nsim){ # begining of loop
u1=rnorm(n,0,1)
u2=rnorm(n,0,1)
x=u1
y=r*u1+sqrt(1-r*r)*u2
x=0.25/sqrt(250)*x + (0.03-0.5*0.25*0.25)/250; y= 0.25/sqrt(250)*y + (0.03-0.5*0.25*0.25)/250
ShareA = 100*cumprod(exp(x))
ShareB = 100*cumprod(exp(y))
c = max(ShareA[n]-X*ShareB[n],0)
ctot=ctot+c
} # end of loop
c=ctot/nsim
c=c*exp(-0.03)
c
my question is how can i turn this into a function where i change the correalation, r, to anything i like?
hope this makes sense... essentially the issue is turning this code into a function.
thanks
all suggestions appreciated.
To turn this call option into a function that accepts a correlation as an argument:
callOptionEval<-function(r=0.85){
X = 1.05 ; n = 250; nsim = 2000; ctot = 0;
for( i in 1:nsim){ # begining of loop
u1=rnorm(n,0,1);
u2=rnorm(n,0,1);
x=u1;
y=r*u1+sqrt(1-r*r)*u2;
x=0.25/sqrt(250)*x + (0.03-0.5*0.25*0.25)/250;
y= 0.25/sqrt(250)*y + (0.03-0.5*0.25*0.25)/250;
ShareA = 100*cumprod(exp(x));
ShareB = 100*cumprod(exp(y));
c = max(ShareA[n]-X*ShareB[n],0);
ctot=ctot+c;
} # end of loop c=ctot/nsim
c=c*exp(-0.03);
return(c)
}
callOptionEval(0.85)# gives 0
callOptionEval(0.5)# gives 12.45512
Note this code doesn't do the useful stuff that functions should like check that the input is between $(-1, 1)$ etc. This is more of a convenience function for the current user.
mycorr <- function(x, r, n, nsim, ctot) {
Your remaining lines and loops here
}
Use it as
mycorr(X = 1.05, r = .85, n = 250, nsim = 2000, ctot = 0)