Optimizing for loop in R - r

I have been doing a lot of research and I think I am missing something when it comes to nested for loops in R. I have two dataframes - one that contains observations and locations where I want to write the outputs and another that has the variable names I am looping through. Right now the loop works, but it is taking 14+ hours to loop through 200 rows which seems a bit excessive. Granted I am preforming 12 separate permutations (100 times) at each row, though I would ideally like to do >1000+ permutations. Is there a more efficient way of preforming this for loop? When I run a single observation it takes vey little time to complete (sub 2 seconds), which makes me beg the question that there should be a better way to accomplish this task. Any help you can give in optimizing this code would be greatly appreciated! thanks!
The main dataset is attached(fbfm.xlsx) which is called fm.std
https://www.dropbox.com/s/vmd8d05yxds93j6/fbfm.xlsx?dl=0
library(rothermel)
u.val<-c(5,10,15,25,35,45,55,65,75,85,95,100)
unames <- data.frame(u=u.val,ros.nam=paste("u",u.val,"_ROS",sep=""), stringsAsFactors = FALSE)
ros.out<-data.frame(fm.std)
for (i in 1:dim(unames)[1]){
ros.out[,unames[i,'ros.nam']]<-999
}
ros.out <- as.vector(ros.out)
fm.std <- as.vector(fm.std)
for (i in 1:dim(ros.out)[1]){
ros.out[i,1:32]
for (u in 1:dim(unames)[1]){
ros.out[i,unames[u,'ros.nam']]<-mean(rosunc(modeltype=fm.std[i,'Fuel_Model_Type'], #Dyanmic or static model
w=fm.std[i,4:8], # fuel loads (1, 10, 100, herb, and shrub)
s=fm.std[i,9:13], # SAV measurements
delta=fm.std[i,14], #fuel bed depth
mx.dead=fm.std[i,15], # dead fuel mositure of extinction
h=fm.std[i,16:20], # heat content for fuel classes
m=fm.std[i,c(25,24,23,26,30)], #percent moisture of fuel classes
u = unames[u,'u'],
slope=0,
sdm=0.3,
nsim=100) ) #wind and slope of 0 }}

Consider a more vectorized sapply() approach passing in two vectors, u.val and 1:nrow(fm.std). This will build a 200-row, 12-column matrix that you can convert to a dataframe and then cbind to original dataframe.
ucols <- sapply(u.val,
function(x, y){
mean(rosunc(modeltype=fm.std[y,'Fuel_Model_Type'], # Dyanmic or static model
w=fm.std[y,4:8], # fuel loads (1, 10, 100, herb, and shrub)
s=fm.std[y,9:13], # SAV measurements
delta=fm.std[y,14], # fuel bed depth
mx.dead=fm.std[y,15], # dead fuel mositure of extinction
h=fm.std[y,16:20], # heat content for fuel classes
m=fm.std[y,c(25,24,23,26,30)], # percent moisture of fuel classes
u=x,
slope=0,
sdm=0.3,
nsim=100))
}, 1:nrow(fm.std))
# CONVERT MATRIX TO DATA FRAME
ucols <- data.frame(ucols)
# RENAME COLUMNS
names(test) <- paste("u",u.val,"_ROS",sep="")
# BIND COLUMNS TO ORIGINAL DATA FRAME
ros.out <- cbind(fm.std, ucols)
Alternatively, consider using outer() with transpose, t() to achieve the 200-row and 12-col matrix.
ucols <- t(outer(u.val, 1:nrow(fm.std),
function(x, y){
mean(rosunc(...))
}
))
...

Related

R: Efficiently Calculate Deviations from the Mean Using Row Operations on a DF (Without Using a For Loop)

I am generating a very large data frame consisting of a large number of combinations of values. As such, my coding has to be as efficient as possible or else 1) I get errors like - R cannot allocate vector of size XX or 2) the calculations take forever.
I am to the point where I need to calculate r (in the example below r = 3) deviations from the mean for each sample (1 sample per row of the df)(Labeled dev1 - dev3 in pic below):
These are my data in R:
I tried this (r is the number of values in each sample, here set to 3):
X2<-apply(X1[,1:r],1,function(x) x-X1$x.bar)
When I try this, I get:
I am guessing that this code is attempting to calculate the difference between each row of X1 (x) and the entire vector of X1$x.bar instead of 81 for the 1st row, 81.25 for the 2nd row, etc.
Once again, I can easily do this using for loops, but I'm assuming that is not the most efficient way.
Can someone please stir me in the right direction? Any assistance is appreciated.
Here is the whole code for the small sample version with r<-3. WARNING: This computes all possible combinations, so the df's get very large very quick.
options(scipen = 999)
dp <- function(x) {
dp1<-nchar(sapply(strsplit(sub('0+$', '', as.character(format(x, scientific = FALSE))), ".",
fixed=TRUE),function(x) x[2]))
ifelse(is.na(dp1),0,dp1)
}
retain1<-function(x,minuni) length(unique(floor(x)))>=minuni
# =======================================================
r<-3
x0<-seq(80,120,.25)
X0<-data.frame(t(combn(x0,r)))
names(X0)<-paste("x",1:r,sep="")
X<-X0[apply(X0,1,retain1,minuni=r),]
rm(X0)
gc()
X$x.bar<-rowMeans(X)
dp1<-dp(X$x.bar)
X1<-X[dp1<=2,]
rm(X)
gc()
X2<-apply(X1[,1:r],1,function(x) x-X1$x.bar)
Because R is vectorized you only need to subtract x.bar from from x1, x2, x3 collectively:
devs <- X1[ , 1:3] - X1[ , 4]
X1devs <- cbind(X1, devs)
That's it...
I think you just got the margin wrong, in apply you're using 1 as in row wise, but you want to do column wise so use 2:
X2<-apply(X1[,1:r], 2, function(x) x-X1$x.bar)
But from what i quickly searched, apply family isn't better in performance than loops, only in clarity. Check this post: Is R's apply family more than syntactic sugar?

How to create function of tossing coin R

I'm trying to create a function in R to simulate the experiment of tossing four coins as many times as m times, each experiment records the appearance of "numbers" or "images" on each coin.
Present the results of m experiments in tabular form, and add the "number of sides of the number that appears" in the last column of the table.
Sim_Coin<-function(m){
c1<-c()
c2<-c()
cs<-c()
for(i in 1:m)
{
c1<-rbind(d1,sample(0:1,size=1)
c2<-rbind(d2,sample(0:1,size=1)
}
cs<-c1+c2
v<-cbind(c1,c2,cs)
v<-as.data.frame(v)
names(v)<-c("coin1","coin2","sum")
return(v)
}
But it fails and I don't know how to create the table
R is a vectorized language so in many cases the need for a loop can be avoided. So instead of looping m times, just pick m samples from 0 or 1. This will greatly improve performance.
Also progressively adding onto a vector or data frame with bind function, inside a loop, is slow in R since a new copy of the information is created with each function call.
Take a look at this streamline code:
Sim_Coin<-function(m){
coin1<-sample(c("head", "tail"), size=m, replace=TRUE)
coin2<-sample(c("head", "tail"), size=m, replace=TRUE)
v<-data.frame(coin1, coin2)
v$sum <- apply(v, 1, function(i){sum(i=="head")})
return(v)
}
Sim_Coin(3)
coin1 coin2 sum
1 tail tail 0
2 head head 2
3 tail head 1
Since your question talked about flipping 4 coins and not just 2, here is an expanded version:
Sim_Coin2<-function(m){
n<-4. #number of coins to flip
#create n vectors m long
coins<- lapply(1:n, function(i) {
sample(0:1, size=m, replace=TRUE)
})
#make data frame and rename columns
dfcoin<-as.data.frame(do.call(cbind, coins))
names(dfcoin)<-paste0("Coin", 1:n)
#calculate the number of heads by taking the sum of the rows
dfcoin$sum <- rowSums(dfcoin)
dfcoin
}
Sim_Coin2(10)

Implementing fast numerical calculations in R

I was trying to do an extensive computation in R. Eighteen hours have passed but my RStudio seems to continue to work. I'm not sure if I could have written the script in a different way to make it faster. I was trying to implement a Crank–Nicolson type method over a 50000 by 350 matrix as shown below:
#defining the discretization of cells
dt<-1
t<-50000
dz<-0.0075
z<-350*dz
#velocity & diffusion
v<-2/(24*60*60)
D<-0.02475/(24*60*60)
#make the big matrix (all filled with zeros)
m <- as.data.frame(matrix(0, t/dt+1, z/dz+2)) #extra columns/rows for boundary conditions
#fill the first and last columns with constant boundary values
m[,1]<-400
m[,length(m)]<-0
#implement the calculation
for(j in 2:(length(m[1,])-1)){
for(i in 2:length(m[[1]])){
m[i,][2:length(m)-1][[j]]<-m[i-1,][[j]]+
D*dt*(m[i-1,][[j+1]]-2*m[i-1,][[j]]+m[i-1,][[j-1]])/(dz^2)-
v*dt*(m[i-1,][[j+1]]-m[i-1,][[j-1]])/(2*dz)
}}
Is there a way to know how long would it take for R to implement it? Is there a better way of constructing the numerical calculation? At this point, I feel like excel could have been faster!!
Just making a few simple optimisations really helps here. The original version code of your code would take ~ 5 days on my laptop. Using a matrix and calculating just once values that are reused in the loop, we bring this down to around 7 minutes
And think about messy constructions like
m[i,][2:length(m)-1][[j]]
This is equivalent to
m[[i, j]]
which would be faster (as well as much easier to understand). Making this change further reduces the runtime by another factor of over 2, to around 3 minutes
Putting this together we have
dt<-1
t<-50000
dz<-0.0075
z<-350*dz
#velocity & diffusion
v<-2/(24*60*60)
D<-0.02475/(24*60*60)
#make the big matrix (all filled with zeros)
m <- (matrix(0, t/dt+1, z/dz+2)) #extra columns/rows for boundary conditions
# cache a few values that get reused many times
NC = NCOL(m)
NR = NROW(m)
C1 = D*dt / dz^2
C2 = v*dt / (2*dz)
#fill the first and last columns with constant boundary values
m[,1]<-400
m[,NC]<-0
#implement the calculation
for(j in 2:(NC-1)){
for(i in 2:NR){
ma = m[i-1,]
ma.1 = ma[[j+1]]
ma.2 = ma[[j-1]]
m[[i,j]] <- ma[[j]] + C1*(ma.1 - 2*ma[[j]] + ma.2) - C2*(ma.1 - ma.2)
}
}
If you need to go even faster than this, you can try out some more optimisations. For example see here for how different ways of indexing the same element can have very different execution times. In general it is better to refer to column first, then row.
If all the optimisations you can do in R are not enough for your speed requirements, then you might implement the loop in RCpp instead.

Using by in parallel in R

I am a noob R programmer. I have written a code that needs to apply a function to a data frame split by factors. The data frame in itself contains about 1 million 324961 observations with 64376 factors in the variable that we use to slice the dataframe.
The code is as follows:
library("readstata13")
# Reading the Stata Data file into R
bod_fb <- read.dta13("BoD_nonmissing_fb.dta")
gen_fuzzy_blau <- function(bod_sample){
# Here we drop the Variables that are not required in creating the Fuzzy-Blau index
bod_sample <- as.data.frame(bod_sample)
bod_sample$tot_occur <- as.numeric(bod_sample$tot_occur)
bod_sample$caste1_occ <- as.numeric(bod_sample$caste1_occ)
bod_sample$caste2_occ <- as.numeric(bod_sample$caste2_occ)
bod_sample$caste3_occ <- as.numeric(bod_sample$caste3_occ)
bod_sample$caste4_occ <- as.numeric(bod_sample$caste4_occ)
# Calculating the Probabilites of a director belonging to a caste
bod_sample$caste1_occ <- (bod_sample$caste1_occ)/(bod_sample$tot_occur)
bod_sample$caste2_occ <- (bod_sample$caste2_occ)/(bod_sample$tot_occur)
bod_sample$caste3_occ <- (bod_sample$caste3_occ)/(bod_sample$tot_occur)
bod_sample$caste4_occ <- (bod_sample$caste4_occ)/(bod_sample$tot_occur)
#Dropping the Total Occurances column, as we do not need it anymore
bod_sample$tot_occur<- NULL
# Here we replace all the blanks with NA
bod_sample <- apply(bod_sample, 2, function(x) gsub("^$|^ $", NA, x))
bod_sample <- as.data.frame(bod_sample)
# Here we push all the NAs in the caste names and caste probabilities to the end of the row
# So if there are only two castes against a name, then they become caste1 and caste2
caste_list<-data.frame(bod_sample$caste1,bod_sample$caste2,bod_sample$caste3,bod_sample$caste4)
caste_list = as.data.frame(t(apply(caste_list,1, function(x) { return(c(x[!is.na(x)],x[is.na(x)]) )} )))
caste_list_prob<-data.frame(bod_sample$caste1_occ,bod_sample$caste2_occ,bod_sample$caste3_occ,bod_sample$caste4_occ)
caste_list_prob = as.data.frame(t(apply(caste_list_prob,1, function(x) { return(c(x[!is.na(x)],x[is.na(x)]) )} )))
# Here we write two functions: 1. gen_castelist
# 2. gen_caste_prob
# gen_castelist: This function takes the row number (serial number of the direcor)
# and returns the names of all the castes for which he has a non-zero
# probability.
# gen_caste_prob: This function takes the row number (serial number of the director)
# and returns the probability with which he belongs to the caste
#
gen_castelist <- function(x){
y <- caste_list[x,]
y <- as.vector(y[!is.na(y)])
return(y)
}
gen_caste_prob <- function(x){
z <- caste_list_prob[x,]
z <- z[!is.na(z)]
z <- as.numeric(z)
return(z)
}
caste_ls <-list()
caste_prob_ls <- list()
for(i in 1:nrow(bod_sample))
{
caste_ls[[i]]<- gen_castelist(i)
caste_prob_ls[[i]]<- gen_caste_prob(i)
}
gridcaste <- expand.grid(caste_ls)
gridcaste <- data.frame(lapply(gridcaste, as.character), stringsAsFactors=FALSE)
gridcasteprob <- expand.grid(caste_prob_ls)
# Generating the Joint Probability
gridcasteprob$JP <- apply(gridcasteprob,1,prod)
# Generating the Similarity Index
gen_sim_index <- function(x){
x <- t(x)
a <- as.data.frame(table(x))
sim_index <- sum(a$Freq^2)/(sum(a$Freq))^2
return(sim_index)
}
gridcaste$sim_index <- apply(gridcaste,1,gen_sim_index)
# Generating fuzzyblau
gridcaste$fb <- gridcaste$sim_index * gridcasteprob$JP
fuzzy_blau_index <- sum(gridcaste$fb)
remove_list <- c("gridcaste","")
return(fuzzy_blau_index)
}
fuzzy_blau_output <- by(bod_fb,bod_fb$code_year,gen_fuzzy_blau)
# Saving the output as a dataframe with two columns
# Column 1 is the fuzzy blau index
# Column 2 is the code_year
code_year <- names(fuzzy_blau_output)
fuzzy_blau <- as.data.frame(as.vector(unlist(fuzzy_blau_output)))
names(fuzzy_blau) <- c("fuzzy_blau_index")
fuzzy_blau$code_year <- code_year
bod_fb <- merge(bod_fb,fuzzy_blau,by = "code_year")
save.dta13(bod_fb,"bod_fb_example.dta")
If the code is tl;dr, the summary is as follows:
I have a dataframe bod_fb. I need to apply the apply the gen_fuzzy_blau function on this dataframe by slicing the dataframe with factors of bod_fb$code_year.
Since the function is very huge sequential processing is taking more than a day and ends up in running out of memory. The function gen_fuzzy_blau returns a numeric variable fuzzy_blau_index for each code_year of the dataframe. I use by to apply the function on each slice. I wanted to know if there is a way to parallelly implement this code so that multiple instances of the function run at once on different slices of the dataframe. I did not find a by implementation for parallel package and I did not know how to pass the dataframes as iterators while using foreach and doParallel packages.
I have a AMD A8 laptop with 4GB RAM and windows 7 sp1 home basic. I have given 20GB as page file memory (this was after I got the memory error).
Thank you
EDIT 1: #milkmotel I have eliminated the redundancy in the code and removed the for loops, but a huge amount of time is being wasted in gen_sim_index in the function, I am using the proc.time()function to gauge the time that each part of the code is taking.
The function is supposed to the following to a row:
if we have a row (not a vector) say: a a b c the similarity index will be (2/4)^2 + (1/4)^2 + (1/4)^2 ie, summation of (no of occurences of each unique element of each row/total no of elements in the row)^2
I am unable to use the apply function directly on the row because each element in a row because each element in the row has different factors and table() does not output the frequencies properly.
What is an efficient way to code the gen_sim_index function?
You're saving your data 6 times over in 6 different variables. Try not doing that.
and it takes a day because you're running character indexing on a ridiculous amount of data with gsub().
Take your code out of your gen_fuzzy_blau function as it provides no value to wrap it up into one function rather than running it all independently. Then run it all one line at a time. If it takes too long to run, reconsider your method. Your code is incredibly inefficient.

repeated random samples from subsets within data

I am an R novice and have been running into a brick wall with what should be a simple for loop process. Data consists of a list with dimensions of 81161 by 9. The observations are of individuals over time. The current need is to isolate an unique group of observations and randomly extract one of the observation data points. At this stage I have reviewed and been attempting a few options none of which are being executed property. First the for loop then apply.
To provide a better idea of the workflow I have outlined. This should be a relatively straight forward split-apply-combine. The apply is a sample with restriction to unique individual_days. To do this the code goes through the basic defining of over all dimensions, then defining of unique values, a sort and rank (from which the unique individuals_day are set to an ordianl scale and then these are linked back to the originsl data using the individuals_day as the key). From this point I have attempted two alternatives with the for loops --- first using a split by rank to provide DSrank$'1, 2, 3...n' (attempted to be used in example 2) and using the subset seen in example 1. A single sample would then be extracted at random and collated into a sub-dataset. From this point other analysis will be performed.
### example 1: for loop
SDS <- list()
for(i in 1:length(UID)) {
`SDS[[i]] <- sample(nrow(SplitDS$[i]), 1, replace=FALSE)
`SDS[[i]]["Samples"] < i
}
head(SDS)
### example 2: for loop
SDS <- list()
for(i in 1:length(UID)) {
`SubSDID <- subset(DSID, DSrank == 'i', )
`SDS <- sample(nrow(SubSDID), 1, replace=FALSE)
}
head(SDS)
### example 3: apply subset
bootstrap <- lapply(1:length(UID), function(i) {
`samp <- sample(1:nrow(DSID$DSrank, rep = TRUE)
`DSID$DSrank[sampl, ]
}
These have been based off examples I have found through CRAN, stackoverflow, and other R-code search results.
If you have any suggestions, tips, or tricks that you could share it would be greatly appreciated.
MB

Resources