R: C stack error when using recursive function - optimizing the code - r

I am trying to make a simple "model" of loss of an initial value based on a number of criteria, as seen in my code below:
mtDNAlen = 16299
copies = 10
mito = 50
fraction = copies*mtDNAlen*0.75
Rounds = 40
A = data.frame(
Length = c(rep(mtDNAlen,copies))
)
data = list()
for (i in 1:mito){
data[[i]]=list()
}
for (i in 1:length(data)){
for (j in 1:(Rounds+1)){
data[[i]][[j]]=data.frame(A)
}}
finaldata = data.frame(X = seq(1,copies,1))
random.sample = function(x) {
x = sample(sample.data, copies,
prob=Prob,
replace=FALSE)
if (sum(x) > fraction) return(x)
Recall(x)
}
for (i in 1:length(data)){
for(j in 1:Rounds){
data[[i]][[j]]$Deletion = sample(c("Yes","No"), nrow(data[[i]][[j]]), prob=c(0.05,0.95), replace=TRUE)
data[[i]][[j]]$DelLength = ifelse(data[[i]][[j]]$Deletion == "Yes", sample(seq(0,15000,1), replace = TRUE),0)
data[[i]][[j]]$Length = data[[i]][[j]]$Length - data[[i]][[j]]$DelLength
data[[i]][[j]] = data[[i]][[j]][data[[i]][[j]]$Length > 2000,]
data[[i]][[j]] = rbind(data[[i]][[j]],data[[i]][[j]])
Prob = c(rep(16300,nrow(data[[i]][[j]]))) - data[[i]][[j]]$Length
Prob = Prob / sum(Prob)
sample.data = c(data[[i]][[j]]$Length)
data[[i]][[j+1]]$Length = random.sample(sample.data)
finaldata[[i]] = data[[i]][[j+1]]$Length
}
}
However, when I try to run my code I get the following error:
Error: C stack usage 7969700 is too close to the limit
After searching, this seems to occur when using recursive functions. However, I am new to these types of functions, and I don't know how to optimise my code further, in order to et rig of this error.
I should mention, that when I run my code without prob=Prob in the random.sample function, there is no error. So I guess the error comes from having to redo the random.sample so many times? Can I do this in a better way to avoid the C stack error? I haven't been able to find an alternative myself.
Finally, if I set Rounds and mito to very small values, I can complete the calculation, but it is not really usable for me...
Thanks!
EDIT
I tried to switch to using repeat instead, making my for loop look like this:
for (i in 1:length(data)){
for(j in 1:Rounds){
data[[i]][[j]]$Deletion = sample(c("Yes","No"), nrow(data[[i]][[j]]), prob=c(0.05,0.95), replace=TRUE)
data[[i]][[j]]$DelLength = ifelse(data[[i]][[j]]$Deletion == "Yes", sample(seq(0,15000,1), replace = TRUE),0)
data[[i]][[j]]$Length = data[[i]][[j]]$Length - data[[i]][[j]]$DelLength
data[[i]][[j]] = data[[i]][[j]][data[[i]][[j]]$Length > 2000,]
data[[i]][[j]] = rbind(data[[i]][[j]],data[[i]][[j]])
Prob = c(rep(16300,nrow(data[[i]][[j]]))) - data[[i]][[j]]$Length
Prob = Prob / sum(Prob)
sample.data = c(data[[i]][[j]]$Length)
repeat {
v2 <- sample(sample.data, copies,
prob=Prob,
replace=FALSE)
if( sum(v2) > fraction )
break
}
return(v2)
data[[i]][[j+1]]$Length = v2
finaldata[[i]] = data[[i]][[j+1]]$Length
}
}
However, now I can't get the sampled data to go to the next dataframe, ie. the line data[[i]][[j+1]]$Length = v2 seems to not be working. I can see v2 is getting generated and it looks to have the appropriate form and data stored...

Related

Repeat iteration in a for loop in r

I am trying to generate a for loop that will repeat a sequence of the following:
sample(x = 1:14, size = 10, replace = TRUE, prob = c(1/4,1/4,1/4,1/4)
I want it to repeat 5000 times. So far, I include the above as the body of the loop and added
for (i in seq_along[1:5000]){
at the beginning but I am getting an error message saying
Error in seq_along[1:10000] : object of type 'builtin' is not subsettable
We need replicate
out <- replicate(5000, sample(x = 1:14, size = 10, replace = TRUE, prob = c(1/4,1/4,1/4,1/4)), simplify = FALSE)
There are a few issues here.
#MartinGal noted the syntax issues with seq_along and the missing ). Note that you can use seq(n) or 1:n in defining the number of loops.
You are not storing the sampled vectors anywhere, so the for loop will run the code but you won't capture the output.
You have x = 1:14 but you only have 4 prob values, which suggests you intended x = 1:4 (either that or you are 10 prob values short).
Here's one way to address these issues using a for loop.
n <- 5
s <- 10
xmax <- 4
p <- 1/4
out <- matrix(nrow = n, ncol = s, byrow = TRUE)
set.seed(1L)
for (i in seq(n)) {
out[i, ] <- sample(x = seq(xmax), size = s, replace = TRUE, prob = rep(p, xmax))
}
As andrew reece notes in his comment, it looks like you want x = 1:4 Depending what you want to do with your result you could generate all of the realizations at one time since you are sampling with replacement and then store the result in a matrix with 5000 rows of 10 realizations per row. So:
x <- sample(1:4, size = 5000 * 10, replace = TRUE, prob = c(1/4,1/4,1/4,1/4))
result <- matrix(x, nrow = 5000)

How to write down the number of the repeated data in R function

Suppose I have repeated simulated data (100 times). Then, suppose that I would like to apply one function on each of these data. Since my data is repeated (sometimes 1000 times) I would like to know at which data my code working at this moment. That is, I would like my code to show the number of each data it is working on it. For example, when my code start with first data, then I would like it to let me know this is the first data. Then, the same for the second data and so on. I know that I will get the number of my data in console as a list. However, my function is much more complicated. This is just a simple example to explain my problem. I would like my code to let me know at which data it is working now.
This is my code:
N.a = 186; N.b = 38; N.ab=13; N.o = 284
## 1) numerical optimization
llk = function(xpar){
tmp = exp(c(xpar,0))
pr = tmp/sum(tmp) ## A/B/O
res1 = N.a*log(pr[1]^2+2*pr[1]*pr[3]) + N.b*log(pr[2]^2+2*pr[2]*pr[3])
res2 = N.ab*log(2*pr[1]*pr[2]) + N.o*log(pr[3]^2)
-res1-res2
}
pr = rep(1/3,3) ## A/B/O
it = 0; pdiff = 1
while( (it<100)&(pdiff>1e-5) ){
tmp = c(pr[1]^2, 2*pr[1]*pr[3])
tmp = tmp/sum(tmp)
N.aa = N.a*tmp[1]
N.ao = N.a*tmp[2]
tmp = c(pr[2]^2, 2*pr[2]*pr[3])
tmp = tmp/sum(tmp)
N.bb = N.b*tmp[1]
N.bb = N.b*tmp[1]
N.bo = N.b*tmp[2]
pr1 = c(2*N.aa+N.ao+N.ab, 2*N.bb+N.bo+N.ab, N.ao+N.bo+2*N.o)
pr1 = pr1/sum(pr1)
pdiff = mean(abs(pr1-pr))
it = it+1
pr = pr1
cat(it, pr, "\n")
}
How I can use cat function. For example, how to use this in my code:
cat(paste0("data: ", i, "\n"))

Optimise code for a simple monte carlo like simulation

I run the following code that works but just take ages and I'm sure there is a way to get the same results much faster.
runs <- 1000
prediction <- runif(77,0,1)
n< - length(prediction)
df.all <- data.frame(Preds = rep(prediction, runs),
simno=rep(1:runs,each=n))
for (x in 1:runs) {
for (i in 1:length(df.all$Preds)){
df.all$rand[i] <- sample(1:100,1)
df.all$Win[i] <- ifelse(df.all$rand[i]<df.all$Preds[i]*100,1,0)
}
}
df.all% >% group_by(simno) %>% summarise(Wins=sum(Win)) -> output
This can easily be vectorise by:
Performing a single sample operation (not the additional replace = TRUE argument.
Performing a single comparison >
You can remove the inner for loop to get
for (x in 1:runs) {
df.all$rand = sample(1:100, size = length(prediction), replace=TRUE)
df.all$Win = df.all$rand < df.all$Preds*100
}
You can then take it one step further and remove that loop
df.all$rand = sample(1:100, n = nrow(df.all), replace=TRUE)
df.all$Win = df.all$rand < df.all$Preds*100

R - Dealing with zeros in radomized subsamples

I've run into a little problem, simulating the throw of dice. Basically im doing this to get familiar with loops and their output.
Intention is to simulate the throw of two dice as follows:
R = 100
d6 = c(1:6)
d = 60
DICE = NULL
for (i in 1:R)
{
i <- as.factor((sample(d6, size=d, replace = T)) + (sample(d6, size=d, replace = T)))
j <- summary(i)
DICE = rbind(DICE, j)
}
head(DICE)
HIS = colMeans(DICE)
boxplot(DICE)
title(main= "Result 2d6", ylab= "Throws", xlab="")
relHIS = (HIS / sum(HIS))*100
relHIS
Problems occur if the result in one cathegorie is 0 (result did not occur in the sample). If this happens randomly in the first subsample one or more the categories (numbers 2-12) are missing. This causes problems ("number of columns of result is not a multiple of vector length (arg 2)") in the following subsamples.
Im sure there is a really simple solution for this, by defining everything beforehand...
Thanks for your help!
Here are some fixes:
R = 100
d6 = c(1:6)
d = 60
DICE = matrix(nrow = R, ncol = 11) #pre-allocate
colnames(DICE) <- 2:12
for (i in 1:R)
{
sim <- ordered((sample(d6, size=d, replace = T)) + (sample(d6, size=d, replace = T)),
levels = 2:12) #define the factor levels
sumsim <- table(sim)
DICE[i,] <- sumsim #sub-assign
}
head(DICE)
HIS = colMeans(DICE)
boxplot(DICE)
title(main= "Result 2d6", ylab= "Throws", xlab="")
prop.table(HIS) * 100
Always pre-allocate your result data structure. Growing it in a loop is terribly slow and you know how big it needs to be. Also, don't use the same symbol for the iteration variable and something else.
Omit as.factor()in your seventh row

How to avoid for loop in this dataset?

For loops in R are extremely slow but I know no alternative way of how to achieve the following.
As shown in this screenshot:
What I want the output format to look like:
> gene_id tss_id x y
in which, x = isosub$q1_FPKM / iso.agg$q1_FPKM // (correspond gene_id)
y = isosub$q2_FPKM / iso.agg$q2_FPKM
Here is my code with the for loop:
length = length(isosub$gene_id)
tmp = data.frame(isosub$gene_id, isosub$q1_FPKM, isosub$q2_FPKM)
j = 1
denominator_q1 = iso.agg$q1_FPKM[j]
denominator_q2 = iso.agg$q2_FPKM[j]
gene_id = isosub$gene_id
tmpq1 = tmp$isosub.q1_FPKM
tmpq2 = tmp$isosub.q2_FPKM
isoq1 = iso.agg$q1_FPKM
isoq2 = iso.agg$q2_FPKM
o2_q1 = rep(0, length)
o2_q2 = rep(0, length)
i = 0
for (i in 1:length){
if (gene_id[i+1] == gene_id[i]){
o2_q1[i] = tmpq1[i] / denominator_q1
o2_q2[i] = tmpq2[i] / denominator_q2
}else{
o2_q1[i] = tmpq1[i] / denominator_q1
o2_q2[i] = tmpq2[i] / denominator_q2
j = j + 1
denominator_q1 = isoq1[j]
denominator_q2 = isoq2[j]
}
}
when length = 1000, system.time shows that:
> user system elapsed
> 55.74 0.00 56.45
And my actual length is even larger: 13751.
Do you want to do a merge?
outdf <- merge(isosub[c("gene_id", "tss_id", "q1_FPKM", "q2_FPKM")],
iso.agg[c("gene_id", "q1_FPKM", "q2_FPKM")],
by="gene_id",
suffix=c(".1", ".2"))
outdf$x <- outdf$q1_FPKM.1 / outdf$q1_FPKM.2
outdf$y <- outdf$q2_FPKM.1 / outdf$q2_FPKM.2
If you ended up here looking for ways to avoid or speed up loops, check out this answer:
Speed up the loop operation in R
It helped me with a similar problem I was having, and shows ways to keep necessary loops but increase performance dramatically.

Resources