I am relatively new to R and all its wisdom and I am trying to be more efficient with my script. I am using a loop to simulate how an animal moves among different sites. The problem that I have is that when I increase the number of sites or change the initial parameters (based on fixed probability of moving or staying in the same site) then I end with a very complicated loop. If I have to run several different simulations with different parameters, I prefer a more efficient loop or function that could adjust to different situations. The first loop will fill a matrix according to the initial probabilities and the second loop will compared the cumulative probability matrix against a random number from a list of values (10 in this example) and will decide the fate of that individual (either stay or go to a new site)
Here is a simplification of my code:
N<-4 # number of sites
sites<-LETTERS[seq(from=1,to=N)]
p.stay<-0.45
p.move<-0.4
move<-matrix(c(0),nrow=N,ncol=N,dimnames=list(c(sites),c(sites)))
from<-array(0,c(N,N),dimnames=list(c(sites),c(sites)))
to<-array(0,c(N,N),dimnames=list(c(sites),c(sites)))
# Filling matrix with fixed probability #
for(from in 1:N){
for(to in 1:N){
if(from==to){move[from,to]<-p.stay} else {move[from,to]<-p.move/(N-1)}
}
}
move
cumsum.move<-cumsum(data.frame(move))
steps<-100
result<-as.character("") # for storing results
rand<-sample(random,steps,replace=TRUE)
time.step<-data.frame(rand)
colnames(time.step)<-c("time.step")
time.step$event<-""
to.r<-(rbind(sites))
j<-sample(1:N,1,replace=T) # first column to select (random number)
k<-sample(1:N,1,replace=T) # site selected after leaving and coming back
# Beginning of the longer loop #
for(i in 1:steps){
if (time.step$time.step[i]<cumsum.move[1,j]){time.step$event[i]<-to.r[1]} else
if (time.step$time.step[i]<cumsum.move[2,j]){time.step$event[i]<-to.r[2]} else
if (time.step$time.step[i]<cumsum.move[3,j]){time.step$event[i]<-to.r[3]} else
if (time.step$time.step[i]<cumsum.move[4,j]){time.step$event[i]<-to.r[4]} else
if (time.step$time.step[i]<(0.95)){time.step$event[i]<-NA} else
if (time.step$time.step[i]<1.0) break # break the loop
result[i]<-time.step$event[i]
j<-which(to.r==result[i])
if(length(j)==0){j<-k} # for individuals the leave and come back later
}
time.step
result
This loop is part of a bigger loop that will simulate and store the result after a series of simulations. Any ideas or comments on how I can improve the efficiency of this loop so that I can easily modify the number of sites or change the initial probability parameters without repeating or having to do major edits of the loop will be appreciated.
I'm not sure if I'm capturing the essence of your code, but this is faster than the for loops. This started having an advantage as soon as we start getting past a few thousand steps. I replace "random" with a sample of the uniform distribution (runif())
system.time(
time.step$event <- sapply(
time.step$time.step,
function(x) rownames(
cumsum.move[which(cumsum.move[,j] > x),])[[1]]
)
)
Here are my results # 10,000 steps. I'm working on a laptop so 100,000 with the for loop didn't compute in under 1 minute, but sapply did it in 14 seconds.
> system.time(
+ time.step$event <- sapply(
+ time.step$time.step,
+ function(x) rownames(
+ cumsum.move[which(cumsum.move[,j] > x),])[[1]]
+ )
+ )
user system elapsed
1.384 0.000 1.387
> head(time.step)
time.step event
1 0.2787642 C
2 0.3098240 C
3 0.9079045 D
4 0.9904031 D
5 0.3754330 C
6 0.6984415 C
> system.time(
+ for(i in 1:steps){
+ if (time.step$time.step[i]<cumsum.move[1,j]){time.step$event[i]<-to.r[1]} else
+ if (time.step$time.step[i]<cumsum.move[2,j]){time.step$event[i]<-to.r[2]} else
+ if (time.step$time.step[i]<cumsum.move[3,j]){time.step$event[i]<-to.r[3]} else
+ if (time.step$time.step[i]<cumsum.move[4,j]){time.step$event[i]<-to.r[4]}
+ result[i]<-time.step$event[i]
+ }
+ )
user system elapsed
3.137 0.000 3.143
> head(time.step)
time.step event
1 0.2787642 C
2 0.3098240 C
3 0.9079045 D
4 0.9904031 D
5 0.3754330 C
6 0.6984415 C
Related
So, I'm relatively new to R and have the following problem:
I want to run 1000 generations of a population of some organism. At each generation there is a certain probability to change from one environment to the other (there are just two different "environments").
Now, the code works just fine and I do get the desired results. However one small issue that still needs to be resolved is that for every run, the initial environment seems to be set at environment 1 even though I defined the initial environment to be randomly sampled (should be either environment 1 OR 2; you can find this in line 12 of the second block of code).
If anybody could help me resolve this issue, I would be very thankful.
simulate_one_gen_new <- function(K, N_total_init, N_wt, N_generalist, N_specialist, growth_wt, growth_generalist, growth_specialist, mut_rate) {
scaling <- min(K/(N_wt + N_generalist + N_specialist),1)
# draw offspring according to Poisson distribution
offsp_wt <- rpois(1, scaling * N_wt * growth_wt)
offsp_generalist <- rpois(1, scaling * N_generalist * growth_generalist)
offsp_specialist <- rpois(1, scaling * N_specialist * growth_specialist)
# draw new mutants according to Poisson distribution
mut_wt_to_generalist <- rpois(1, N_wt * (mut_rate/2))
mut_wt_to_specialist <- rpois(1, N_wt * (mut_rate/2))
# determine new population sizes of wild type and mutant
N_wt_new <- max(offsp_wt - mut_wt_to_specialist - mut_wt_to_generalist, 0)
N_generalist_new <- max(offsp_generalist + mut_wt_to_generalist,0)
N_specialist_new <- max(offsp_specialist + mut_wt_to_specialist,0)
N_total_new <- N_wt_new + N_generalist_new + N_specialist_new
return(c(N_total_new, N_wt_new, N_generalist_new, N_specialist_new))
}
# Test the function
print(simulate_one_gen_new(100,100,100,0,0,0.9,1.0,1.1,0.001))
The code block above is needed to simulate one single generation.
simulate_pop_new <- function(K, N_total_init,N_init_wt,
growth_vec1, growth_vec2, growth_vec3,
mut_rate, switch_prob) {
# determine that there are no mutants present at time 0
N_init_generalist <- 0
N_init_specialist <- 0
# Create the vector in which to save the results, including the index of the environment
pop_vector <- c(N_total_init,N_init_wt, N_init_generalist, N_init_specialist, 1)
# initiate the variables
pop_new <- c(N_total_init, N_init_wt, N_init_generalist, N_init_specialist)
# determine that the first environment is either 1 or 2
env_temp <- sample(1:2, size = 1, replace = T)
tmax <- 1000
j <- 0
# run the simulation until generation t_max
for (i in 1:tmax) {
# redefine the current population one generation later
pop_new <- c(simulate_one_gen_new(K,pop_new[1],pop_new[2],pop_new[3],pop_new[4], growth_vec1[env_temp],growth_vec2[env_temp], growth_vec3[env_temp],mut_rate),env_temp)
# add the new population sizes to the output vector
pop_vector <- rbind(pop_vector,pop_new)
# determine whether environmental switch occurs and make it happen
env_switch <- rbinom(1,1,switch_prob)
if (env_switch==1)
{
if(env_temp==1) env_temp <- 2
else env_temp <- 1
}
# condition to stop the simulation before t_max: either the population has only one of the two mutants left, or the whole population goes extinct
if ((pop_new[2] == 0 & pop_new[3] == 0) | (pop_new[2] == 0 & pop_new[4] == 0)){j=j+1}
if (j == 100) break #here we let it run 100 generations longer after the conditions above are met
}
# define the row and column names of the output vector
rownames(pop_vector) <- (0:length(pop_vector[1]))[1:length(pop_vector[,1])] # note that the vector has to be cut if the simulation stopped early
colnames(pop_vector) <- c("total","wt","generalist","specialist","env")
# return the result
return(pop_vector)
}
# Test the function and plot the result
# create your simulation data
output <- simulate_pop_new(1000,1000,1000,c(0.98,0.99),c(1.04,1.02),c(0.96,1.1),0.001,0.5)
# show the last few lines of the data table
print(tail(output))
# determine x axis range
x_range <- 0:(length(output[,1])-1)
# Create data frame from output (or just rename it)
df <- data.frame(output)
# Add a new column to our output that simply shows the Generations
df$Generation<-1:nrow(df)
# Manually create data frame where the genotypes are not separate but all in one column. Note that we need to repeat/ add together all other values since our "Genotype" column will be three times longer.
Genotype <- rep(c("wt", "generalist", "specialist"), each = length(output[,1]))
PopSize <- c(df$wt, df$generalist, df$specialist)
Generation <- rep(df$Generation, 3)
environment <- rep(df$env, 3)
# Let's also create a column solely for the total population
All_Genotypes <- df$generalist + df$wt + df$specialist
N_tot <- rep(All_Genotypes, 3)
# Create a new data frame containing the modified columns which we will be using for plotting
single_run <- data.frame(Generation, Genotype, PopSize, N_tot, environment)
print(tail(single_run))
Above is the second block of code which now simulates 1000 generations.
I am trying to gather some bootstrapped estimates for summary statistics from a dataset, but I want to resample parts of the dataset at different rates, which has led me to lean on nested for loops.
Specifically, suppose there are two groups in my dataset, and each group is further divided into test and control. Group 1 has a 75% / 25% test-control ratio, and Group 2 has a 50% / 50% test-control ratio.
I want to resample such that the dataset is the same size, but the test-control ratios are 90% / 10% for both groups... in other words, resample different subgroups at different rates, which strikes me as different from what the boot package normally does.
In my dataset, I created a group variable representing the groups, and a groupT variable representing group concatenated with test/control, e.g.:
id group groupT
1 1 1T
2 1 1T
3 2 2T
4 1 1C
5 2 2C
Here's what I am running right now, with nreps arbitrarily set to be my number of bootstrap replications:
for (j in 1:nreps){
bootdat <- datafile[-(1:nrow(datafile)),] ## initialize empty dataset
for (i in unique(datafile$groups)){
tstring<-paste0(i,"T") ## e.g. 1T
cstring<-paste0(i,"C") ## e.g. 1C
## Size of test group resample should be ~90% of total group size
tsize<-round(.90*length(which(datafile$groups==i)),0)
## Size of control group resample should be total group size minus test group size
csize<-length(which(datafile$groups==i))-tsize
## Continue building bootdat by rbinding the test and control resample
## before moving on to the next group
## Note the use of datafile$groupT==tstring to ensure I'm only sampling from test, etc.
bootdat<-rbind(bootdat,datafile[sample(which(datafile$groupT==tstring),size=tsize,
replace=TRUE),])
bootdat<-rbind(bootdat,datafile[sample(which(datafile$groupT==cstring),size=csize,
replace=TRUE),])
}
## Here, there is code to grab some summary statistics from bootdat
## and store them in statVector[j] before moving on to the next replication
}
With a dataset size of about 1 million total records, this takes 3-4 minutes per replication. I feel certain there is a better way to do this either with sapply or possibly some of the dplyr functions, but I have come up empty in my attempts so far. Any help would be appreciated!
I'd strongly encourage you to look into data.table and foreach, using keyed searches for bootstraps. It'll allow you to do a single bootstrap very rapidly, and you can run each bootstrap independently on a different core. Each bootstrap of the below takes 0.5 seconds on my machine, searching through a table of 1 million rows. Something like the following should get you started:
library(data.table)
library(foreach)
library(doMC)
registerDoMC(cores=4)
# example data
dat <- data.table(id=1:1e6, group=sample(2, size=1e6, replace=TRUE), test_control=sample(c("T","C"), size=1e5, replace=TRUE))
# define number of bootstraps
nBootstraps <- 1000
# define sampling fractions
fraction_test <- 0.90
fraction_control <- 1 - fraction_test
# get number that you want to sample from each group
N.test <- round(fraction_test * dim(dat)[1])
N.control <- round(fraction_control * dim(dat)[1])
# key data by id
setkey(dat, id)
# get ID values for each combination, to be used for keyed search during bootstrapping
group1_test_ids <- dat[group==1 & test_control=="T"]$id
group1_control_ids <- dat[group==1 & test_control=="C"]$id
group2_test_ids <- dat[group==2 & test_control=="T"]$id
group2_control_ids <- dat[group==2 & test_control=="C"]$id
results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %dopar% {
# sample each group with the defined sizes, with replacement
g1T <- dat[.(sample(group1_test_ids, size=N.test, replace=TRUE))]
g1C <- dat[.(sample(group1_control_ids, size=N.control, replace=TRUE))]
g2T <- dat[.(sample(group2_test_ids, size=N.test, replace=TRUE))]
g2C <- dat[.(sample(group2_control_ids, size=N.control, replace=TRUE))]
dat.all <- rbindlist(list(g1T, g1C, g2T, g2C))
dat.all[, bootstrap := n]
# do summary stats here with dat.all, return the summary stats data.table object
return(dat.summarized)
}
EDIT: example below includes a lookup table for each of any arbitrary number of unique groups. The IDs corresponding to each combination of group + (test OR control) can be referenced within a foreach loop for simplicity. With lower numbers for N.test and N.control (900 and 100) it spits out the results of 1000 bootstraps in
library(data.table)
library(foreach)
# example data
dat <- data.table(id=1:1e6, group=sample(24, size=1e6, replace=TRUE), test_control=sample(c("T","C"), size=1e5, replace=TRUE))
# save vector of all group values & change group to character vector for hashed environment lookup
all_groups <- as.character(sort(unique(dat$group)))
dat[, group := as.character(group)]
# define number of bootstraps
nBootstraps <- 100
# get number that you want to sample from each group
N.test <- 900
N.control <- 100
# key data by id
setkey(dat, id)
# all values for group
# Set up lookup table for every combination of group + test/control
control.ids <- new.env()
test.ids <- new.env()
for(i in all_groups) {
control.ids[[i]] <- dat[group==i & test_control=="C"]$id
test.ids[[i]] <- dat[group==i & test_control=="T"]$id
}
results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %do% {
foreach(group.i = all_groups, .combine="rbind") %do% {
# get IDs that correspond to this group, for both test and control
control_id_vector <- control.ids[[group.i]]
test_id_vector <- test.ids[[group.i]]
# search and bind
controls <- dat[.(sample(control_id_vector, size=N.control, replace=TRUE))]
tests <- dat[.(sample(test_id_vector, size=N.test, replace=TRUE))]
dat.group <- rbindlist(list(controls, tests))
dat.group[, bootstrap := n]
return(dat.group[])
}
# summarize across all groups for this bootstrap and return summary stat data.table object
}
yielding
> results
id group test_control bootstrap
1: 701570 1 C 1
2: 424018 1 C 1
3: 909932 1 C 1
4: 15354 1 C 1
5: 514882 1 C 1
---
23999996: 898651 24 T 1000
23999997: 482374 24 T 1000
23999998: 845577 24 T 1000
23999999: 862359 24 T 1000
24000000: 602078 24 T 1000
This doesn't involve any of the summary stat calculation time, but here 1000 bootstraps were pulled out on 1 core serially in
user system elapsed
62.574 1.267 63.844
If you need to manually code N to be different for each group, you can do the same thing as with id lookup
# create environments
control.Ns <- new.env()
test.Ns <- new.env()
# assign size values
control.Ns[["1"]] <- 900
test.Ns[["1"]] <- 100
control.Ns[["2"]] <- 400
test.Ns[["2"]] <- 50
... ...
control.Ns[["24"]] <- 200
test.Ns[["24"]] <- 5
then change the big bootstrap loop to look up these values based on the loop's current group:
results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %do% {
foreach(group.i = all_groups, .combine="rbind") %do% {
# get IDs that correspond to this group, for both test and control
control_id_vector <- control.ids[[group.i]]
test_id_vector <- test.ids[[group.i]]
# get size values
N.control <- control.Ns[[group.i]]
N.test <- test.Ns[[group.i]]
# search and bind
controls <- dat[.(sample(control_id_vector, size=N.control, replace=TRUE))]
tests <- dat[.(sample(test_id_vector, size=N.test, replace=TRUE))]
dat.group <- rbindlist(list(controls, tests))
dat.group[, bootstrap := n]
return(dat.group[])
}
# summarize across all groups for this bootstrap and return summary stat data.table object
}
Just like caw5cv, I recommend taking a look at data.table it is usually very efficient in solving such problems, however if you want to choose to work with dplyr then you can try doing something like this:
summary_of_boot_data <- lapply(1:nreps,
function(y){
# get bootdata
bootdata <- lapply(unique(datafile$group),
function(x){
tstring<-paste0(x,"T")
cstring<-paste0(x,"C")
tsize<-round(.90*length(which(datafile$group==x)),0)
csize<-length(which(datafile$group==x))-tsize
df <-rbind(datafile[sample(which(datafile$groupT==tstring),
size=tsize,
replace=TRUE),],
datafile[sample(which(datafile$groupT==cstring),
size=csize,
replace=TRUE),])
return(df)
}) %>% do.call(rbind, .)
# return your summary thing for bootdata e.g. summary(bootdata)
summary(bootdata)
})
summary_of_boot_data
I tried not changing you code a lot, I just replaced the use of for with lapply
hope this helps
EDIT: Based on the comment from Hugh you might want to try using data.table::rbindlist() instead of do.call(rbind, .)
I have following data.frame and dictionaries with pos/negWords:
sent <- data.frame(words = c("just right size and i love this notebook", "benefits great laptop",
"wouldnt bad notebook", "very good quality", "orgtop",
"great improvement", "notebook is not good but i love batterytop"), user = c(1,2,3,4,5,6,7),
stringsAsFactors=F)
posWords <- c("great","improvement","love","great improvement","very good","good","right","very","benefits",
"extra","benefit","top","extraordinarily","extraordinary","super","benefits super")
negWords <- c("hate","bad","not good","horrible")
And the following function, which is matching words in each sentence with pos/negWords from dictionaries and compute sentiment value according frequency of occurance - but it is exact match.
# descending order for words length (prepare data for function below)
wordsDF <- data.frame(words = posWords, value = 1,stringsAsFactors=F)
wordsDF <- rbind(wordsDF,data.frame(words = negWords, value = -1))
wordsDF$lengths <- unlist(lapply(wordsDF$words, nchar))
wordsDF <- wordsDF[order(-wordsDF[,3]),]
rownames(wordsDF) <- NULL
scoreSentence <- function(sentence){
score <- 0
for(x in 1:nrow(wordsDF)){
match <- paste("\\<",wordsDF[x,1],'\\>', sep="") # matching exact words
count <- length(grep(match,sentence)) # count them
if(count){
score <- score + (count * wordsDF[x,2]) # compute score (count * sentValue)
sentence <- gsub(paste0('\\s*\\b', wordsDF[x,1], '\\b\\s*', collapse='|'), '', sentence) # remove words which were matched
} score
}
which generate desired output with calling:
SentimentScore <- unlist(lapply(sent$words, scoreSentence))
bbb <- cbind(sent, SentimentScore)
This resulted into mentioned desired output:
words user SentimentScore
1 just right size and i love this notebook 1 2
2 benefits great laptop 2 2
3 wouldnt bad notebook 3 -1
4 very good quality 4 1
5 orgtop 5 0
6 great improvement 6 1
7 notebook is not good but i love batterytop 7 0
For those purposes for loop was used, but I have 7000 pos/negWords and 200.000 sentences, so it is neverending...
Please, do you have some better solution for this task. Mainly to have the same result in SentimentScore :-)
I'll appreciate any of your advice or solution. Many thanks in advance.
First, you should run on subpieces of your data.frame, because list resizing during lapply probably generates a huge overhead :
ptm = proc.time(); f=lapply(1:100000, function(X){X}); print(proc.time()-ptm)
user system elapsed
0.056 0.004 0.061
ptm = proc.time(); f=lapply(1:1000000, function(X){X}); print(proc.time()-ptm)
user system elapsed
1.112 0.004 1.119
Here a factor 10 in sequence size yields a factor 21 in computation time. So use small lists, then concatenate them in one big list.
The declaration of a big data.frame does not take long compared to its extension, so you have to declare it and then fill it with your sublists :
bbb = data.frame( words=sent[1], user=sent[2], scoreSentence=rep(0, nrow(sent)) )
MAX_SIZE = 10000
for ( ii in 0:(ceiling(nrow(sent)/MAX_SIZE)-1) ) {
selected_rows = (1 + ii * MAX_SIZE):min( (ii+1)*MAX_SIZE, nrow(sent) )
bbb[selected_rows, "scoreSentence"] = unlist(lapply(sent$words[selected_rows], scoreSentence))
}
MAX_SIZE must be big enough because a for loop is slower than lapply (you want to do as little loop through the for as possible) but not too big or the list extension overhead will make the program slower.
Alternative with parallelisation
Parallelisation is a good way to make a set of complex calculations faster by running each of them on a different core. In your case we make the calculations complex by sending big chunks of sentences.
With mclapply from the parallel package, you send each chunk to a different thread each thread is fast because the chunck is not too big. A wrapper for scoreSentence that handle a vector is recquired :
bbb = data.frame( words=sent[1], user=sent[2], scoreSentence=rep(0, nrow(sent)) )
MAX_SIZE = 10000
mc_list = list()
mc_list[[ceiling(nrow(sent)/MAX_SIZE)]] = 0
for ( ii in 0:(ceiling(nrow(sent)/MAX_SIZE)-1) ) {
mc_list[[ii+1]] = (1 + ii * MAX_SIZE):min( (ii+1)*MAX_SIZE, nrow(sent) )
}
bbb[,"scoreSentence"] = unlist(mclapply(mc_list, scoreSentenceWrapper))
scoreSentenceWrapper <- function(selected_rows) {
return(unlist(lapply(sent$words[selected_rows], scoreSentence)))
}
I have a working solution to my problem, but I will not be able to use it because it is so slow (my calculations predict that the whole simulation will take 2-3 years!). Thus I am looking for a better (faster) solution. This is (in essence) the code I am working with:
N=4
x <-NULL
for (i in 1:N) { #first loop
v <-sample(0:1, 1000000, 1/2) #generate data
v <-as.data.frame(v) #convert to dataframe
v$t <-rep(1:2, each=250) #group
v$p <-rep(1:2000, each=500) #p.number
# second loop
for (j in 1:2000) { #second loop
#count rle for group 1 for each pnumber
x <- rbind(x, table(rle(v$v[v$t==1&v$p==j])))
#count rle for group 2 for each pnumber
x <- rbind(x, table(rle(v$v[v$t==2&v$p==j])))
} #end second loop
} #end first loop
#total rle counts for both group 1 & 2
y <-aggregate(x, list(as.numeric(rownames(x))), sum)
In words: The code generates a coin-flip simulation (v). A group factor is generated (1 & 2). A p.number factor is generated (1:2000). The run lengths are recorded for each p.number (1:2000) for both groups 1 & group 2 (each p.number has runs in both groups). After N loops (the first loop), the total run lengths are presented as a table (aggregate) (that is, the run lengths for each group, for each p.number, over N loops as a total).
I need the first loop because the data that I am working with comes in individual files (so I'm loading the file, calculating various statistics etc and then loading the next file and doing the same). I am much less attached to the second loop, but can't figure out how to replace it with something faster.
What can be done to the second loop to make it (hopefully, a lot) faster?
You are committing the cardinal sin of growing an object within a for() loop in R. Don't (I repeat don't) do this. Allocate sufficient storage for x at the beginning and then fill in x as you go.
x <- matrix(nrow = N * (2000 * 2), ncol = ??)
Then in the inner loop
x[ii, ] <- table(rle(....))
where ii is a loop counter that you initialise to 1 before the first loop and increment within the second loop:
x <- matrix(nrow = N * (2000 * 2), ncol = ??)
ii <- 1
for(i in 1:N) {
.... # stuff here
for(j in 1:2000) {
.... # stuff here
x[ii, ] <- table(rle(....))
## increment ii
ii <- ii + 1
x[ii, ] <- table(rle(....))
## increment ii
ii <- ii + 1
} ## end inner loop
} ## end outer loop
Also note that you are reusing index i in bot for()loops which will not work.iis just a normal R object and so bothfor()loops will be overwriting it as the progress. USej` for the second loop as I did above.
Try that simple optimisation first and see if that will allow the real simulation to complete in an acceptable amount of time. If not, come back with a new Q showing the latest code and we can think about other optimisations. The optimisation above is simple to do, optimising table() and rle() might take a lot more work. Noting that, you might look at the tabulate() function which does the heavy lifting in table(), which might be one avenue for optimising that particular step.
If you just want to run rle and table for each combination of the values of v$t and v$p separately, there is no need for the second loop. It is much faster in this way:
values <- v$v + v$t * 10 + v$p * 100
runlength <- rle(values)
runlength$values <- runlength$values %% 2
x <- table(runlength)
y <- aggregate(unclass(x), list(as.numeric(rownames(x))), sum)
The whole code will look like this. If N is as low as 4, the growing object x will not be a severe problem. But generally I agree with #GavinSimpson, that it is not a good programming technique.
N=4
x <-NULL
for (i in 1:N) { #first loop
v <-sample(0:1, 1000000, 1/2) #generate data
v <-as.data.frame(v) #convert to dataframe
v$t <-rep(1:2, each=250) #group
v$p <-rep(1:2000, each=500) #p.number
values <- v$v + N * 10 + v$t * 100 + v$p * 1000
runlength <- rle(values)
runlength$values <- runlength$values %% 2
x <- rbind(x, table(runlength))
} #end first loop
y <-aggregate(x, list(as.numeric(rownames(x))), sum) #tota
I've got a column in a CSV file that looks like c("","1","1 1e-3") (i.e. white space seperated). I'm trying to run through all values, taking the sum() of values where there is at least one value and returning NA otherwise.
My code currently does something like this:
x <- c("","1","1 2 3")
x2 <- as.numeric(rep(NA,length(x)))
for (i in 1:length(x)) {
si <- scan(text=x[[i]],quiet=TRUE)
if (length(si) > 0)
x2[[i]] <- sum(si)
}
I'm struggling to make this fast; x is really a set of columns from a CSV file containing a few hundred thousand rows and thought it should be possible to do this in R.
(these are thinned samples from the posterior of a reversible jump MCMC algorithm, hence combining multiple values as the dimensionality changes throughout the file and I want useful columns).
Building on the idea from #Chase, but handling NA and also avoiding a name for the helper function:
unlist(lapply(strsplit(x, " "),
function(v)
if (length(v) > 0)
sum(as.numeric(v))
else
NA
) )
This seems to perform a bit faster and may work for you.
#define a helper function
f <- function(x) sum(as.numeric(x))
unlist(lapply((strsplit(x3, " ")), f))
#-----
[1] 0 1 6
This will return a zero instead of NA, but maybe that isn't a deal breaker for you?
Let's see how this scales to a larger problem:
#set up variables
x3 <- rep(x, 1e5)
x4 <- as.numeric(rep(NA,length(x3)))
#initial approach
system.time(for (i in 1:length(x3)) {
si <- scan(text=x3[[i]],quiet=TRUE)
if (length(si) > 0)
x4[[i]] <- sum(si)
})
#-----
user system elapsed
30.5 0.0 30.5
#New approach:
system.time(unlist(lapply((strsplit(x3, " ")), f)))
#-----
user system elapsed
0.82 0.01 0.84