Say I have the following code which essentially gives me random simulations for revenue and cost for 12 months
simulate.revenue<-function() {
return(sapply(rnorm(12,100000,30000),function(x) max(0,x)))
}
simulate.cost<-function() {
return(sapply(rnorm(12,50000,20000),function(x) max(0,x)))
}
sim.run<-function() {
revenue<-simulate.revenue()
cost<-simulate.cost()
profit<-revenue-cost
year.simulation<-data.frame(revenue,cost,profit)
return(year.simulation)
}
Now to run the above simulation function 10 times I am aware that I should:
sim.results<-replicate(10,sim.run())
So the question is how do I further process sim.results to say:
find the mean for total yearly profit over each run
find the mean for profit by month over each of the runs (mean(profit[1], mean(profit[2]), ...)
Structure of replicate result:
replicate(1, sim.run()) easily gives you the structure of what is returned: A list item for each column of the data.frame (here 3 list items). Running two simulations adds another 3 list items.
Convert it into proper format:
To convert the list into a data.frame use:
result <- data.frame(matrix(unlist(sim.results), nrow = 12, byrow = FALSE))
In your case every 3 columns of the resulting data.frame correspond to one simulation. To separate the simulations into a list again:
result_list <- list()
m <- 1
n_simulations <- 10
n_columnsPerSimulation <- 3
for (i in seq(1, n_simulations * n_columnsPerSimulation, n_columnsPerSimulation)){
result_list[[m]] <- result[,seq(i, i+n_columnsPerSimulation-1)]
m <- m + 1
}
This is very ugly but seems to work.
Analyze result:
Now you can analyze each simulation e.g. with sapply/lapply like the following example shows:
sapply(result_list, function(x) mean(x[,1]))
Related
I'm trying to run multiple iterations of a function using a different subset of of my dataframe each time. In reality the function takes a very long time, so I want to split the iterations across multiple cores using mclapply. For each iteration I'm using sample to randomly select a subset of the dataframe, and this is inside the function I have written to give to mclapply. However, the results of each of the iterations in the output list are identical, suggesting that mclapply is not re-running the sample lines each time. This must be something to do with how I have written the code, any ideas where I have gone wrong?
Here is a reproducible example of a small dataset that runs quickly. You will notice that the 10 iterations in the d.val.all output list are identical, which is not what I am after.
library(bipartite)
library(doBy)
library(parallel)
# create dummy data
ecto.matrix1=data.frame(replicate(10,sample(0:80,81,rep=TRUE)),Species.mix.90=c(sample(c("R","M","S","B"),81,rep=TRUE)))
# set up the function
funct.resample.d <- function(i) {
RedSites <- row.names(ecto.matrix1)[ecto.matrix1$Species.mix.90=="R"]
MountainSites <- row.names(ecto.matrix1)[ecto.matrix1$Species.mix.90=="M"]
randomSilverSites <- sample(row.names(ecto.matrix1)[ecto.matrix1$Species.mix.90=="S"],8,replace=F)
randomBlackSites <- sample(row.names(ecto.matrix1)[ecto.matrix1$Species.mix.90=="B"],8,replace=F)
resampledSites <- c(RedSites,MountainSites,randomSilverSites,randomBlackSites) # make vector of the site names
matrix=ecto.matrix1[resampledSites,] # select only those rows from the resampled row names
matrix1 = matrix[,colSums(matrix[,-c(ncol(matrix))]) > 0] # drop cols that sum to 0
matrix2=summaryBy(matrix1[,-c(ncol(matrix1))]~Species.mix.90,data=matrix1,FUN=sum)
for (col in 1:ncol(matrix2)){
colnames(matrix2)[col] <- sub(".sum", "", colnames(matrix2)[col]) # remove the sum bit from the col names
}
row.names(matrix2)<-matrix2$Species.mix.90 # make row names
matrix2=subset(matrix2, select=-c(Species.mix.90)) # drop host col
d.val <- dfun(matrix2)$dprime
}
# run mclapply
reps=c(1:10)
d.val.all <- mclapply(reps, funct.resample.d, mc.cores = 10)
In case anyone else is having similar issues, I figured out that the problem was with the summaryBy function rather than sample. I replaced summaryBy with aggregate, and the randomization worked fine.
matrix2=aggregate(. ~ Species.mix.90, matrix1, sum)
I'm trying to create a function in R to simulate the experiment of tossing four coins as many times as m times, each experiment records the appearance of "numbers" or "images" on each coin.
Present the results of m experiments in tabular form, and add the "number of sides of the number that appears" in the last column of the table.
Sim_Coin<-function(m){
c1<-c()
c2<-c()
cs<-c()
for(i in 1:m)
{
c1<-rbind(d1,sample(0:1,size=1)
c2<-rbind(d2,sample(0:1,size=1)
}
cs<-c1+c2
v<-cbind(c1,c2,cs)
v<-as.data.frame(v)
names(v)<-c("coin1","coin2","sum")
return(v)
}
But it fails and I don't know how to create the table
R is a vectorized language so in many cases the need for a loop can be avoided. So instead of looping m times, just pick m samples from 0 or 1. This will greatly improve performance.
Also progressively adding onto a vector or data frame with bind function, inside a loop, is slow in R since a new copy of the information is created with each function call.
Take a look at this streamline code:
Sim_Coin<-function(m){
coin1<-sample(c("head", "tail"), size=m, replace=TRUE)
coin2<-sample(c("head", "tail"), size=m, replace=TRUE)
v<-data.frame(coin1, coin2)
v$sum <- apply(v, 1, function(i){sum(i=="head")})
return(v)
}
Sim_Coin(3)
coin1 coin2 sum
1 tail tail 0
2 head head 2
3 tail head 1
Since your question talked about flipping 4 coins and not just 2, here is an expanded version:
Sim_Coin2<-function(m){
n<-4. #number of coins to flip
#create n vectors m long
coins<- lapply(1:n, function(i) {
sample(0:1, size=m, replace=TRUE)
})
#make data frame and rename columns
dfcoin<-as.data.frame(do.call(cbind, coins))
names(dfcoin)<-paste0("Coin", 1:n)
#calculate the number of heads by taking the sum of the rows
dfcoin$sum <- rowSums(dfcoin)
dfcoin
}
Sim_Coin2(10)
I've been hitting walls trying to write the results of a loop to a csv. I'm trying to rank data within each of 20 columns. The loop I'm using is:
for (i in 1:ncol(testing_file)) {
print(rank(testing_file[[i]]))
}
This works and prints expected results to screen. I've tried a lot of methods suggested in various discussions to write this result to file or data frame, most with no luck.
I'll just include my most promising lead, which returns only one column of correct data, with a column heading of "testing":
for (i in 1:ncol(testing_file)) {
testing<- (rank(testing_file[[i]]))
testingdf <- as.data.frame(testing)
}
Any help is greatly appreciated!
I found a solution that works:
testage<- data.frame(matrix(, nrow=73, ncol=20)) #This creates an empty data
frame that the ranked results will go into
for (i in 1:ncol(testing_file)) {
testage[i] <- rank(testing_file[[i]])
print(testage[i])
} #this is the loop that ranks data within each column
colnames(testage) <- colnames(testing_file) #take the column names from the
original file and apply them to the ranked file.
I'm bad with nested loops so I'd try:
testing_file <- data.frame(x = 1:5, y = 15:11)
testing <- as.data.frame(lapply(seq_along(testing_file), function (x)
rank(testing_file[, x])))
> testing_file
x y
1 1 15
2 2 14
3 3 13
4 4 12
5 5 11
and gets you out of messy nested loops. Did you want to check results of rank() prior to writing to csv?
or just wrap it in a write.csv, the colnames will be the original df colnames:
> write.csv(testing <- as.data.frame(lapply(seq_along(testing_file),
function (x) rank(testing_file[, x]))), "testing.csv", quote = FALSE)
I am a noob R programmer. I have written a code that needs to apply a function to a data frame split by factors. The data frame in itself contains about 1 million 324961 observations with 64376 factors in the variable that we use to slice the dataframe.
The code is as follows:
library("readstata13")
# Reading the Stata Data file into R
bod_fb <- read.dta13("BoD_nonmissing_fb.dta")
gen_fuzzy_blau <- function(bod_sample){
# Here we drop the Variables that are not required in creating the Fuzzy-Blau index
bod_sample <- as.data.frame(bod_sample)
bod_sample$tot_occur <- as.numeric(bod_sample$tot_occur)
bod_sample$caste1_occ <- as.numeric(bod_sample$caste1_occ)
bod_sample$caste2_occ <- as.numeric(bod_sample$caste2_occ)
bod_sample$caste3_occ <- as.numeric(bod_sample$caste3_occ)
bod_sample$caste4_occ <- as.numeric(bod_sample$caste4_occ)
# Calculating the Probabilites of a director belonging to a caste
bod_sample$caste1_occ <- (bod_sample$caste1_occ)/(bod_sample$tot_occur)
bod_sample$caste2_occ <- (bod_sample$caste2_occ)/(bod_sample$tot_occur)
bod_sample$caste3_occ <- (bod_sample$caste3_occ)/(bod_sample$tot_occur)
bod_sample$caste4_occ <- (bod_sample$caste4_occ)/(bod_sample$tot_occur)
#Dropping the Total Occurances column, as we do not need it anymore
bod_sample$tot_occur<- NULL
# Here we replace all the blanks with NA
bod_sample <- apply(bod_sample, 2, function(x) gsub("^$|^ $", NA, x))
bod_sample <- as.data.frame(bod_sample)
# Here we push all the NAs in the caste names and caste probabilities to the end of the row
# So if there are only two castes against a name, then they become caste1 and caste2
caste_list<-data.frame(bod_sample$caste1,bod_sample$caste2,bod_sample$caste3,bod_sample$caste4)
caste_list = as.data.frame(t(apply(caste_list,1, function(x) { return(c(x[!is.na(x)],x[is.na(x)]) )} )))
caste_list_prob<-data.frame(bod_sample$caste1_occ,bod_sample$caste2_occ,bod_sample$caste3_occ,bod_sample$caste4_occ)
caste_list_prob = as.data.frame(t(apply(caste_list_prob,1, function(x) { return(c(x[!is.na(x)],x[is.na(x)]) )} )))
# Here we write two functions: 1. gen_castelist
# 2. gen_caste_prob
# gen_castelist: This function takes the row number (serial number of the direcor)
# and returns the names of all the castes for which he has a non-zero
# probability.
# gen_caste_prob: This function takes the row number (serial number of the director)
# and returns the probability with which he belongs to the caste
#
gen_castelist <- function(x){
y <- caste_list[x,]
y <- as.vector(y[!is.na(y)])
return(y)
}
gen_caste_prob <- function(x){
z <- caste_list_prob[x,]
z <- z[!is.na(z)]
z <- as.numeric(z)
return(z)
}
caste_ls <-list()
caste_prob_ls <- list()
for(i in 1:nrow(bod_sample))
{
caste_ls[[i]]<- gen_castelist(i)
caste_prob_ls[[i]]<- gen_caste_prob(i)
}
gridcaste <- expand.grid(caste_ls)
gridcaste <- data.frame(lapply(gridcaste, as.character), stringsAsFactors=FALSE)
gridcasteprob <- expand.grid(caste_prob_ls)
# Generating the Joint Probability
gridcasteprob$JP <- apply(gridcasteprob,1,prod)
# Generating the Similarity Index
gen_sim_index <- function(x){
x <- t(x)
a <- as.data.frame(table(x))
sim_index <- sum(a$Freq^2)/(sum(a$Freq))^2
return(sim_index)
}
gridcaste$sim_index <- apply(gridcaste,1,gen_sim_index)
# Generating fuzzyblau
gridcaste$fb <- gridcaste$sim_index * gridcasteprob$JP
fuzzy_blau_index <- sum(gridcaste$fb)
remove_list <- c("gridcaste","")
return(fuzzy_blau_index)
}
fuzzy_blau_output <- by(bod_fb,bod_fb$code_year,gen_fuzzy_blau)
# Saving the output as a dataframe with two columns
# Column 1 is the fuzzy blau index
# Column 2 is the code_year
code_year <- names(fuzzy_blau_output)
fuzzy_blau <- as.data.frame(as.vector(unlist(fuzzy_blau_output)))
names(fuzzy_blau) <- c("fuzzy_blau_index")
fuzzy_blau$code_year <- code_year
bod_fb <- merge(bod_fb,fuzzy_blau,by = "code_year")
save.dta13(bod_fb,"bod_fb_example.dta")
If the code is tl;dr, the summary is as follows:
I have a dataframe bod_fb. I need to apply the apply the gen_fuzzy_blau function on this dataframe by slicing the dataframe with factors of bod_fb$code_year.
Since the function is very huge sequential processing is taking more than a day and ends up in running out of memory. The function gen_fuzzy_blau returns a numeric variable fuzzy_blau_index for each code_year of the dataframe. I use by to apply the function on each slice. I wanted to know if there is a way to parallelly implement this code so that multiple instances of the function run at once on different slices of the dataframe. I did not find a by implementation for parallel package and I did not know how to pass the dataframes as iterators while using foreach and doParallel packages.
I have a AMD A8 laptop with 4GB RAM and windows 7 sp1 home basic. I have given 20GB as page file memory (this was after I got the memory error).
Thank you
EDIT 1: #milkmotel I have eliminated the redundancy in the code and removed the for loops, but a huge amount of time is being wasted in gen_sim_index in the function, I am using the proc.time()function to gauge the time that each part of the code is taking.
The function is supposed to the following to a row:
if we have a row (not a vector) say: a a b c the similarity index will be (2/4)^2 + (1/4)^2 + (1/4)^2 ie, summation of (no of occurences of each unique element of each row/total no of elements in the row)^2
I am unable to use the apply function directly on the row because each element in a row because each element in the row has different factors and table() does not output the frequencies properly.
What is an efficient way to code the gen_sim_index function?
You're saving your data 6 times over in 6 different variables. Try not doing that.
and it takes a day because you're running character indexing on a ridiculous amount of data with gsub().
Take your code out of your gen_fuzzy_blau function as it provides no value to wrap it up into one function rather than running it all independently. Then run it all one line at a time. If it takes too long to run, reconsider your method. Your code is incredibly inefficient.
I have an ODE model in Matlab for which I'm interested in performing some parameter sweeps.
I am trying to port the following code from Matlab to R
for i = 1:numel(sweep1)
initial_conditions(6)=sweep1(i);
for j = 1:numel(sweep2)
parameters(3)=sweep2(j);
[t,y] = ode23s(#(timespan, initial_conditions) MODEL(timespan, initial_conditions, parameters), timespan, initial_conditions);
results_cell{i,j}=[y(end,1),y(end,2)];
The 2 FOR statements above vary first 1 initial condition (i), then for each i vary a parameter (j) and run the solver. The output from the solver for each iteration of the loop is then collected in a cell 'results_cell'
This runs fine in Matlab but I need to port it to R. The loops are the same and the solver code is implemented using deSolve, however I am not sure how to collect the results from the solver at each iteration of the loop as R doesn't have cells like Matlab, and how to gather {i,j} from each loop along with the 2 ode outputs.
Ultimately I would like to plot a heat map of the ode solver output vs the values in each of the 2 parameter sweeps.
Thanks for any help.
Here what I would do: I run the ode23 once to get the structure of the solution.
sweep1 =2
sweep2 =3
library(pracma)
f <- function(t, x,i=1,j=0)
as.matrix(c(x[1] * ((i+j) - x[2]^2) -x[2], x[1]))
t0 <- 0
tf <- 20
x0 <- as.matrix(c(0, 0.25))
sol = ode23(f, t0, tf, x0,1,1)$y
res = tail(sol,1)
Then I use replicate to create the structure of the final output matrix. Using this trick avoid us to deal with pre-allocating arrays. replicate will do for us.
results_cell = replicate(sweep1,replicate(sweep2,res))
I just run my final simulation and assign each solution to results_cell
for (i in seq(sweep1))
for (j in seq(sweep2))
results_cell[,,j,i] = tail(ode23(f, t0, tf, x0,i,j)$y,1)
I'm assuming sweep1 and sweep2 are both vectors of numbers. What you can do is use expand.grid to make a data frame of the combinations of that, and then loop over the frame once with apply:
# sweep 1, sweep 2
sweep1 <- c(1, 2, 4)
sweep2 <- c(3, 5, 7)
# expand out the combinations
combinations <- expand.grid(sweep1=sweep1, sweep2=sweep2)
# apply over the data frame
results <- apply(combinations, 1, function(row) {
# set up the parameters from the row which has been passed in.
initial_conditions[6] <- row["sweep1"]
parameters[3] <- row["sweep2"]
# call ode23s
res <- ode23s(initial_conditons, parameters, function, whatever, ...)
# there should be a nicer way than calling nrow twice here, but R doesn't
# seem to have the nice 'end' keyword
# also, we copy in the row, so that's in the output.
c(row, one=res[nrow(res), 1], two=res[nrow(res), 2])
})
# because the apply has flipped rows to columns...
results <- as.data.frame(t(results))
results
# sweep1 sweep2 one two
# 1 1 3 ... ...
# 2 2 3 ... ...
# ...
The result of all this is a data frame of the input combinations and the output combinations. If you want more factors, add on a sweep3, but beware of the combinatorial complexity...