Faster solution to looped grouped RLE calculation - r

I have a working solution to my problem, but I will not be able to use it because it is so slow (my calculations predict that the whole simulation will take 2-3 years!). Thus I am looking for a better (faster) solution. This is (in essence) the code I am working with:
N=4
x <-NULL
for (i in 1:N) { #first loop
v <-sample(0:1, 1000000, 1/2) #generate data
v <-as.data.frame(v) #convert to dataframe
v$t <-rep(1:2, each=250) #group
v$p <-rep(1:2000, each=500) #p.number
# second loop
for (j in 1:2000) { #second loop
#count rle for group 1 for each pnumber
x <- rbind(x, table(rle(v$v[v$t==1&v$p==j])))
#count rle for group 2 for each pnumber
x <- rbind(x, table(rle(v$v[v$t==2&v$p==j])))
} #end second loop
} #end first loop
#total rle counts for both group 1 & 2
y <-aggregate(x, list(as.numeric(rownames(x))), sum)
In words: The code generates a coin-flip simulation (v). A group factor is generated (1 & 2). A p.number factor is generated (1:2000). The run lengths are recorded for each p.number (1:2000) for both groups 1 & group 2 (each p.number has runs in both groups). After N loops (the first loop), the total run lengths are presented as a table (aggregate) (that is, the run lengths for each group, for each p.number, over N loops as a total).
I need the first loop because the data that I am working with comes in individual files (so I'm loading the file, calculating various statistics etc and then loading the next file and doing the same). I am much less attached to the second loop, but can't figure out how to replace it with something faster.
What can be done to the second loop to make it (hopefully, a lot) faster?

You are committing the cardinal sin of growing an object within a for() loop in R. Don't (I repeat don't) do this. Allocate sufficient storage for x at the beginning and then fill in x as you go.
x <- matrix(nrow = N * (2000 * 2), ncol = ??)
Then in the inner loop
x[ii, ] <- table(rle(....))
where ii is a loop counter that you initialise to 1 before the first loop and increment within the second loop:
x <- matrix(nrow = N * (2000 * 2), ncol = ??)
ii <- 1
for(i in 1:N) {
.... # stuff here
for(j in 1:2000) {
.... # stuff here
x[ii, ] <- table(rle(....))
## increment ii
ii <- ii + 1
x[ii, ] <- table(rle(....))
## increment ii
ii <- ii + 1
} ## end inner loop
} ## end outer loop
Also note that you are reusing index i in bot for()loops which will not work.iis just a normal R object and so bothfor()loops will be overwriting it as the progress. USej` for the second loop as I did above.
Try that simple optimisation first and see if that will allow the real simulation to complete in an acceptable amount of time. If not, come back with a new Q showing the latest code and we can think about other optimisations. The optimisation above is simple to do, optimising table() and rle() might take a lot more work. Noting that, you might look at the tabulate() function which does the heavy lifting in table(), which might be one avenue for optimising that particular step.

If you just want to run rle and table for each combination of the values of v$t and v$p separately, there is no need for the second loop. It is much faster in this way:
values <- v$v + v$t * 10 + v$p * 100
runlength <- rle(values)
runlength$values <- runlength$values %% 2
x <- table(runlength)
y <- aggregate(unclass(x), list(as.numeric(rownames(x))), sum)
The whole code will look like this. If N is as low as 4, the growing object x will not be a severe problem. But generally I agree with #GavinSimpson, that it is not a good programming technique.
N=4
x <-NULL
for (i in 1:N) { #first loop
v <-sample(0:1, 1000000, 1/2) #generate data
v <-as.data.frame(v) #convert to dataframe
v$t <-rep(1:2, each=250) #group
v$p <-rep(1:2000, each=500) #p.number
values <- v$v + N * 10 + v$t * 100 + v$p * 1000
runlength <- rle(values)
runlength$values <- runlength$values %% 2
x <- rbind(x, table(runlength))
} #end first loop
y <-aggregate(x, list(as.numeric(rownames(x))), sum) #tota

Related

R - Sample function doesn't seem to be working in loop

So, I'm relatively new to R and have the following problem:
I want to run 1000 generations of a population of some organism. At each generation there is a certain probability to change from one environment to the other (there are just two different "environments").
Now, the code works just fine and I do get the desired results. However one small issue that still needs to be resolved is that for every run, the initial environment seems to be set at environment 1 even though I defined the initial environment to be randomly sampled (should be either environment 1 OR 2; you can find this in line 12 of the second block of code).
If anybody could help me resolve this issue, I would be very thankful.
simulate_one_gen_new <- function(K, N_total_init, N_wt, N_generalist, N_specialist, growth_wt, growth_generalist, growth_specialist, mut_rate) {
scaling <- min(K/(N_wt + N_generalist + N_specialist),1)
# draw offspring according to Poisson distribution
offsp_wt <- rpois(1, scaling * N_wt * growth_wt)
offsp_generalist <- rpois(1, scaling * N_generalist * growth_generalist)
offsp_specialist <- rpois(1, scaling * N_specialist * growth_specialist)
# draw new mutants according to Poisson distribution
mut_wt_to_generalist <- rpois(1, N_wt * (mut_rate/2))
mut_wt_to_specialist <- rpois(1, N_wt * (mut_rate/2))
# determine new population sizes of wild type and mutant
N_wt_new <- max(offsp_wt - mut_wt_to_specialist - mut_wt_to_generalist, 0)
N_generalist_new <- max(offsp_generalist + mut_wt_to_generalist,0)
N_specialist_new <- max(offsp_specialist + mut_wt_to_specialist,0)
N_total_new <- N_wt_new + N_generalist_new + N_specialist_new
return(c(N_total_new, N_wt_new, N_generalist_new, N_specialist_new))
}
# Test the function
print(simulate_one_gen_new(100,100,100,0,0,0.9,1.0,1.1,0.001))
The code block above is needed to simulate one single generation.
simulate_pop_new <- function(K, N_total_init,N_init_wt,
growth_vec1, growth_vec2, growth_vec3,
mut_rate, switch_prob) {
# determine that there are no mutants present at time 0
N_init_generalist <- 0
N_init_specialist <- 0
# Create the vector in which to save the results, including the index of the environment
pop_vector <- c(N_total_init,N_init_wt, N_init_generalist, N_init_specialist, 1)
# initiate the variables
pop_new <- c(N_total_init, N_init_wt, N_init_generalist, N_init_specialist)
# determine that the first environment is either 1 or 2
env_temp <- sample(1:2, size = 1, replace = T)
tmax <- 1000
j <- 0
# run the simulation until generation t_max
for (i in 1:tmax) {
# redefine the current population one generation later
pop_new <- c(simulate_one_gen_new(K,pop_new[1],pop_new[2],pop_new[3],pop_new[4], growth_vec1[env_temp],growth_vec2[env_temp], growth_vec3[env_temp],mut_rate),env_temp)
# add the new population sizes to the output vector
pop_vector <- rbind(pop_vector,pop_new)
# determine whether environmental switch occurs and make it happen
env_switch <- rbinom(1,1,switch_prob)
if (env_switch==1)
{
if(env_temp==1) env_temp <- 2
else env_temp <- 1
}
# condition to stop the simulation before t_max: either the population has only one of the two mutants left, or the whole population goes extinct
if ((pop_new[2] == 0 & pop_new[3] == 0) | (pop_new[2] == 0 & pop_new[4] == 0)){j=j+1}
if (j == 100) break #here we let it run 100 generations longer after the conditions above are met
}
# define the row and column names of the output vector
rownames(pop_vector) <- (0:length(pop_vector[1]))[1:length(pop_vector[,1])] # note that the vector has to be cut if the simulation stopped early
colnames(pop_vector) <- c("total","wt","generalist","specialist","env")
# return the result
return(pop_vector)
}
# Test the function and plot the result
# create your simulation data
output <- simulate_pop_new(1000,1000,1000,c(0.98,0.99),c(1.04,1.02),c(0.96,1.1),0.001,0.5)
# show the last few lines of the data table
print(tail(output))
# determine x axis range
x_range <- 0:(length(output[,1])-1)
# Create data frame from output (or just rename it)
df <- data.frame(output)
# Add a new column to our output that simply shows the Generations
df$Generation<-1:nrow(df)
# Manually create data frame where the genotypes are not separate but all in one column. Note that we need to repeat/ add together all other values since our "Genotype" column will be three times longer.
Genotype <- rep(c("wt", "generalist", "specialist"), each = length(output[,1]))
PopSize <- c(df$wt, df$generalist, df$specialist)
Generation <- rep(df$Generation, 3)
environment <- rep(df$env, 3)
# Let's also create a column solely for the total population
All_Genotypes <- df$generalist + df$wt + df$specialist
N_tot <- rep(All_Genotypes, 3)
# Create a new data frame containing the modified columns which we will be using for plotting
single_run <- data.frame(Generation, Genotype, PopSize, N_tot, environment)
print(tail(single_run))
Above is the second block of code which now simulates 1000 generations.

Converting Mahalanobis p1 probabilities to p2 probabilities - is vectorization possible in this context?

I'm trying to write a function that takes in p1 probabilities for Mahalanobis distances and returns p2 probabilities. The formula for p2, along with a worked example is given at on the IBM website. I have written a function (below) that solves the problem, and allows me to reproduce the p2 values given in the worked example at the aforementioned webpage.
p1_to_p2 <- function(p1,N) {
p2 <- numeric(length(p1))
for (i in 1:length(p1))
{
k <- i;
p1_value <- p1[i];
start_value <- 1;
while (k >= 1)
{
start_value = start_value - choose(N,N-k+1) * (1-p1_value)^(N-k+1) * (p1_value)^(k-1)
k <- k-1;
}
p2[i] <- start_value;
}
return(p2)
}
p1 <- c(.0046132,.0085718,.0390278,.0437704,.0475222)
N <- 73
p1_to_p2(p1,N)
Although the function works, it's been suggested to me by a colleague that it's inefficient/poorly written as it's not vectorized. This is indeed potentially relevant since in general we will be converting a lot more than just 5 p1 values to p2 values.
I have some limited experience vectorizing code, but I am wondering if a vectorized solution is possible in this context since within the loop the variable start_value constantly needs to update itself. If vectorization is not possible, is there some other way I should improve the code so that it works better?
Here is one way to do it, Breaking the steps here can help(Please read the comments):
#Input:
N <- 73
p1 <- c(.0046132,.0085718,.0390278,.0437704,.0475222)
n <- N:(N-length(p1)+1)
# code:
mahalanobis_dist = function(x=x,n){
m = max(n)
max_min = Reduce(`*`,c(1, n[-length(n)]), accumulate = TRUE)
acc = c(1, Reduce(`*`, seq_along(n), accumulate = TRUE)[-length(n)])
comns = max_min/acc
exp <- comns*((1 - x)**n)*(x**(m - n))
return(1- sum(exp))
} ## the calculation of Mahalanobis distances
## This is just an iterator for each of the sequences we have to run the above function
ls <- lapply(n, function(x)(max(n):x))
## creating a list of iterators
## applying mapply, mapply or Map can iterate multiple inputs of the function,
## here the input p1 and ls , p1 is your input points, ls is the iterator created above
mapply(mahalanobis_dist,p1, ls)
## Applying the function on each iterators
#Output:
#> mapply(mahalanobis_dist,p1, ls)
#[1] 0.2864785 0.1299047 0.5461263 0.3973690
#[5] 0.2662369
Note:
Also, one can join the last two steps like below, with one function and correct iteration this can be achieved:
mapply(mahalanobis_dist,p1, lapply(n, function(x)(max(n):x)))

Assigning value to dataframe in R - for loop speed

I have the following code:
n <- 1e6
no_clm <- rpois(n,30)
hold <- data.frame("x" = double(n))
c = 1
for (i in no_clm){
ctl <- sum(rgamma(i,30000)-2000)
hold[c,1] <- ctl
#hold <- rbind(hold,df)
c = c +1
}
Unfortunately the speed of this code is quite slow. I've narrowed down the speed to hold[c,1] <- ctl. If I remove this then the code runs near instantly.
How can I make this efficient? I need to store the results to some sort of dataframe or list in a fast fashion. In reality the actual code is more complex than this but the slowing point is the assigning.
Note that the above is just an example, in reality I have multiple calculations on the rgamma samples and each of these calculations are then stored in a large dataframe.
Try this
hold=data.frame(sapply(no_clm,function(x){
return(sum(rgamma(x,30000)-2000))
}))
It looks like you can just use one call to rgamma, as you are iterating over the number of observations parameter.
So if you do one call and the split the vector to the lengths required (no_clm) you can then just iterate over that list and sum
n <- 1e6
no_clm <- rpois(n, 30)
hold <- data.frame("x" = double(n))
# total observations to use for rgamma
total_clm <- sum(no_clm)
# get values
gammas <- rgamma(total_clm, 30000) - 2000
# split into list of lengths dictated by no_clm
hold$x <- sapply(split(gammas, cumsum(sequence(no_clm) == 1)), sum)
This took 5.919892 seconds
Move into sapply() loop instead of a for loop and then realise 2000 * no_clm can be moved outside the loop (to minimise number of function calls).
n <- 1e6
no_clm <- rpois(n, 30)
hold <- data.frame(x = sapply(no_clm, function(i) sum(rgamma(i, 30000))) - 2000 * no_clm)
You may observe a speed pickup using data.table:
dt = data.table(no_clm)
dt[, hold := sapply(no_clm, function(x) sum(rgamma(x, 30000)-2000))]

How do I turn a loop into a vector-based code?

I need to split a dataframe into N parts which shoud have a given length ("proglen").
I created a for-loop which gives me the desired result as a list of Dataframes.
Now I want to change the for-loop to a vector based code.
I'm not sure how to convert a loop into vector based code in R.
obslen = 600 ;proglen = 50 ; N = 50 ; testdat <- list()
for (i in 1:N){
test <- df[df$d >= df$d[obslen + i * proglen] &
df$d < df$d[proglen + obslen +i * proglen],]
testdat[[i]] <- test
}
The variable $d has the type Posixct.
The result should contain N Dataframes.
These dfs start at the day [obslen + i * proglen] and have the length "proglen".
This is almost similar to what you have done, just replaced by lapply and a little bit more readable. This will give you a list of data frames. This can also be done with recursive function but will be difficult to understand (in my view).
lapply(1:N, function(idx){
this_idx <- obslen + idx * proglen
next_idx <- obslen + (idx + 1) * proglen
df[this_idx:next_idx, ]
})
If the dates are distinct then you can always use the split function:
testdat <- split(df[obslen + seq(1, N * proglen), ], rep(seq(N), each = proglen))

Storing output of nested loop in R

I am new to R but trying desperately to learn the ropes. In fact I feel a little stupid asking this question as I have gone through a number of similar problems but have not been able to get the desired results. My code is as shown below :
## Initializing Parameters
fstart <- 960 ## Start frequency in MHz
fstop <- 1240 ## Stop Frequency In MHz
bw <- 5.44 ## IF Bandwidth in MHz
offset <- 100 ## Max. Variation in TOD in milliseconds
f_dwell <- 1 ## Time spent on each search frequency in millisecond
iterations <- 100 ## No. of iterations to run
## No. of possible frequencies
f <- seq((fstart + bw/2), (fstop - bw/2), by=bw)
## Initializing the frequency table
freq_table <- matrix (NA, nrow=(2*offset +1), ncol=offset)
## Fill frequency table row wise with random values of possible frequencies
for (i in 1:(2*offset + 1)){
row_value <- c(sample(f), sample(f, offset-length(f)))
freq_table[i, ] <- row_value
}
## Assign a row from freq_table to unknown node
unknown_node <- freq_table[sample(1:(2*offset + 1), 1), ]
t = numeric(iterations)
## Calculate number of repetitions of frequencies
for(k in 1:iterations){
for(j in 1:offset){
y <- (sort(table(freq_table[, j]), decreasing=TRUE))
x <- as.vector(y) ## Number of repetitions of each frequency
y <- names(y)
## Search Frequencies
sf1 <- as.numeric(y[1])
sf2 <- as.numeric(y[2])
if (unknown_node[j] == sf1){
t[k] <- ((j-1)*f_dwell)*2 + f_dwell
break
}
else {
if (unknown_node[j] == sf2){
t[k] <- ((j-1)*f_dwell)*2 + 2*f_dwell
break
}
}
## Delete rows from freq_table that have sf1 & sf2
freq_table <- subset(freq_table, freq_table[, 1]!=sf1 & freq_table[, 1]!=sf2 )
}
}
print(t)
If I run this without the k for loop, I get different values of variable t every time. However, I wanted to run the inner for loop iteratively and get a vector of t values each time the inner for loop runs. I do get the length of t as 100, but the values are repeating. The first few values (2 0r 3 or sometimes 4) are different, but the rest keep repeating. I can't figure out why.

Resources