I am a relatively new R programmer and have written a script that takes some statistical results and will ultimately compare it to a vector of results in which the target variable has been randomized. The result vector contains the statistical results of n simulations. As the number of simulations increases (I would like to run 10,000 simulations at least) the run time is longer than I would like. I have tried increasing the performance in ways I know to modify the code, but would love the help of others in optimizing it. The relevant part of the code is below.
#CREATE DATA
require(plyr)
Simulations <- 10001
Variation <- c("Control", "A", "B","C")
Trials <- c(727,724,723,720)
NonResponse <- c(692,669,679,682)
Response <- c(35,55,44,38)
ConfLevel <- .95
#PERFORM INITIAL CALCS
NonResponse <- Trials-Response
Data <-data.frame(Variation, NonResponse, Response, Trials)
total <- ddply(Data,.(Variation),function(x){data.frame(value = rep(c(0,1),times = c(x$NonResponse,x$Response)))})
total <- total[sample(1:nrow(total)), ]
colnames(total) <- c("Variation","Response")
#CREATE FUNCTION TO PERFORM SIMULATIONS
targetshuffle <- function(x)
{
shuffle_target <- x[,"Response"]
shuffle_target <- data.frame(sample(shuffle_target))
revised <- cbind(x[,"Variation"], shuffle_target)
colnames(revised) <- c("Variation","Yes")
yes_variation <- data.frame(table(revised$Yes,revised$Variation))
colnames(yes_variation) <- c("Yes","Variation","Shuffled_Response")
Shuffled_Data <- subset(yes_variation, yes_variation$Yes==1)
Shuffled_Data <- Shuffled_Data[match(Variation, Shuffled_Data$Variation),]
yes_variation <- cbind(Data,Shuffled_Data)
VectorPTest_All <- yes_variation[,c("Variation","NonResponse","Response","Trials","Shuffled_Response")]
Control_Only <- yes_variation[yes_variation$Variation=="Control",]
VectorPTest_Chall <- subset(yes_variation,!(Variation=="Control"))
VectorPTest_Chall <- VectorPTest_Chall[,c("Variation","NonResponse","Response","Trials","Shuffled_Response")]
ControlResponse <- Control_Only$Response
ControlResponseRevised <- Control_Only$Shuffled_Response
ControlTotal <- Control_Only$Trials
VariationCount <- length(VectorPTest_Chall$Variation)
VP <- data.frame(c(VectorPTest_Chall,rep(ControlResponse),rep(ControlResponseRevised),rep(ControlTotal)))
names(VP) <- c("Variation","NonResponse","Response", "Trials", "ResponseShuffled", "ControlReponse",
"ControlResponseShuffled","ControlTotal")
VP1 <<- data.frame(VP[,c(5,7,4,8)])
VP2 <<- data.frame(VP[,c(3,6,4,8)])
ptest <- apply(VP1, 1, function(column) prop.test(x=c(column[1], column[2]),
n=c(column[3], column[4]), alternative="two.sided",
conf.level=ConfLevel, correct=FALSE)$p.value)
min_p_value <- min(ptest)
return(min_p_value)
}
#CALL FUNCTION
sim_result <- do.call(rbind, rlply(Simulations, targetshuffle(total)))
Offhand, one thing to look at is creating all the data frames. Each time you do that you're copying all the data in the constituent object. If the dimensions are predictable you might consider creating empty matrices at the beginning of the function and populating them as you go.
Related
So I have sampled a set of lakes at x timepoints throughout the year. I also have deployed loggers etc. in the water and I want to use daily averages from these loggers, at the timepoint of the visit to x days/hours before. Sometimes I also just grab the a sample for the timepoint of the visit.
This is my solution, it works just fine but since I experiment alot with some model assumptions and perform sensitivity analyses it operates unsatisfactory slow.
I seem to have solved most of my R problems with loops and I often encounter more efficient scripts, it would be very interesting to see some more effective alternatives to my code.
Below code just generates some dummy data..
library(dplyr)
library(lubridate)
do.pct.sat <- function(x,y,z){
t <- x
do <- y
p <- z
atm <- (p*100)/101325
do.sat <- atm*exp(-139.34411+157570.1/(t+273.15)-66423080/(t+273.15)^2+12438000000/(t+273.15)^3-862194900000/(t+273.15)^4)
do.pct.sat <- (do/do.sat)*100
return(do.pct.sat)
}#function for calculating the % oxygen saturation
#here's some dummy date resembling real data
date.initial <- as.POSIXct("2022-06-01")#deployment date
date.end <- as.POSIXct("2022-10-01")#date of retrieval
id <- c("a","b","c")#lake id
lake <- list()#make dataset list for each lake
s <- list()#list of dataframes for the samples from the lake logger timelines
#loop below generates dummy data. this is not part of the real script that I want to improve.
for(i in 1:3){
datetime <- seq(from = date.initial,to = date.end,by=10*60)#10 minute intervals from deploy to retrieve
l <- length(datetime)#vector length of datetime
#set dummy data
do <- rnorm(l,mean = 10,sd=3)#o2 conc.
pressure <- rnorm(l,mean = 980,sd=50)#baro pressure
temp <- rnorm(l,mean=15,sd=5)#water temp
k.z <- rnorm(l,mean=0.35,sd=0.1)#gas exchange koeff / mixed layer depth
dosat.pct <- do.pct.sat(temp,do,pressure)#oxygen sat in %
iso <- as.data.frame(cbind(datetime,do,dosat.pct,temp,pressure,k.z))#bind dummy dataframe to resemble real data
iso$datetime <- as.POSIXct(iso$datetime,origin = "1970-01-01")
lake[[i]] <- iso#save the data frame to the lake logger list
samples <- as.POSIXct(sample((date.initial+5*24*60*60):date.end, 7, replace=FALSE),origin = "1970-01-01")#randomize 7 timepoints
s[[i]] <- as.data.frame(samples)#save it in empty data frame
s[[i]]$lake <- id[i]
}
names(lake) <- id
samples <- bind_rows(s)
samples$samples <- round_date(samples$samples,unit="10 minutes")#rounds my random samples to closest 10 minute
Below is the function that I want to effectivize (same library). I think it operates slow because I take one date at a time, before taking the next;
sample.lakes <- function(average=3){
dts <- list()#empty list
for(i in 1:length(lake)){
print(id[i])
data = lake[[i]]
y <- samples[grepl(id[i],samples$lake),]
dates <- y$samples
#empty vectors to fill with values sampled in loop
avg.kz <- vector()
sd.kz <- vector()
do.mgl <- vector()
dosat.pct <- vector()
temp.c <- vector()
for (k in 1:length(dates)){
print(k)
#below I filter the logger data to contain timepoint of sampling minus number of days I want the average from 'averages'.
prior.days = filter(data, datetime > as.POSIXct(dates[k])-(24*60*60)*average & datetime < as.POSIXct(dates[k]))
#fill the empty vectors with value I desire, mean and sd k.z and point sample of the other variables.
avg.kz[k] = mean(prior.days$k.z)
sd.kz[k] = sd(prior.days$k.z)
temp.c[k] <- data[grepl(dates[k],data$datetime),]$temp
do.mgl[k] <- data[grepl(dates[k],data$datetime),]$do
dosat.pct[k] <- data[grepl(dates[k],data$datetime),]$dosat.pct
}
sd.kz[is.na(sd.kz)] <- 0
#add them to data frame y
y$dosat.pct <- dosat.pct
y$do.mgl <- do.mgl
y$temp.c <- temp.c
y$avg.kz <- avg.kz
y$sd.kz <- sd.kz
dts[[i]] <- y#add to single-row dataframe
}
iso <- bind_rows(dts)#make a complete dataframe with samples.
return(iso)
}
iso <- sample.lakes(average=4)#do not set average to > 5 in this example script
I would appreciaty any suggestions alot!
My guess is that this part using grepl:
data[grepl(dates[k],data$datetime),]
inside your inner for loop is slow.
Couldn't you instead try just seeing if the datetimes are the same with ==?
In addition, you only need to subset data once.
Try this as an alternative:
for (k in 1:length(dates)){
print(k)
prior.days = filter(data, datetime > as.POSIXct(dates[k])-(24*60*60)*average & datetime < as.POSIXct(dates[k]))
avg.kz[k] = mean(prior.days$k.z)
sd.kz[k] = sd(prior.days$k.z)
sub_data <- data[data$datetime == dates[k], ]
temp.c[k] <- sub_data$temp
do.mgl[k] <- sub_data$do
dosat.pct[k] <- sub_data$dosat.pct
}
Consider the first example:
It calculates mean within the loop.
st <- Sys.time() #Starting Time
set.seed(123456789)
vara <- c()
sda <- c()
mvara <- c() #store mean
msda <- c() #store mean of standard deviation
K <- 100000
for(i in 1:K) {
a <- rnorm(30)
vara[i] <- var(a)
sda[i] <- sd(a)
mvara[i] <- mean(mvara)
msda[i] <- mean(msda)
}
et <- Sys.time()
et-st #time taken by code (approx more than one minute)
Consider the same code, except that the same mean is calculated outside the loop.
st <- Sys.time() #Starting Time
set.seed(123456789)
vara <- c()
sda <- c()
K <- 100000
for(i in 1:K) {
a <- rnorm(30)
vara[i] <- var(a)
sda[i] <- sd(a)
}
mvara <- cumsum(vara)/ (1:K)
msd <- cumsum(sda)/ (1:K)
et <- Sys.time() #less than 5 seconds
I just wanted to know, why there is so much difference in performance of both the codes? Where one should take care when using loop?
R is fastest when you use its internal optimized code to execute loops. My understanding of the reasons behind that are poor (the thread in the comment above has explanations from more knowledgeable people), but I believe some of it has to do with memory pre-allocation, and some with the way it transforms the problem into more efficient pieces.
Your code "outside the loop" could be made yet about 20x faster (on my system, went from 7.17 sec to 0.43 sec) by creating all your random numbers first, and then solving the whole table at once, instead of swapping between those two tasks in your loop. And that's using dplyr; I presume a data.table solution could be another 5-10x faster, especially given the large number of groups.
library(dplyr)
set.seed(123456789)
K <- 100000
n <- 30
a_df <- data.frame(trial = rep(1:K, each = 30),
val = rnorm(K*n))
results <- a_df %>%
group_by(trial) %>%
summarize(vara = var(val),
sda = sd(val)) %>%
mutate(mvara = cumsum(vara) / trial,
msd = cumsum(sda) / trial)
The data is:
name <- c("Gen1","Gen2","Gen3")
QuantityE <- c(200,100,50)
PriceE <- c(10,12,50)
QuantityAS <- c(100,50,10)
PriceAS <- c(1,5,7)
mydata <- data.frame(name, QuantityE, PriceE , QuantityAS,PriceAS )
I have the following objective function:
Minimize total cost when multiplying combinations of
((PriceE*QuantityE) + (PriceAS* QuantityAS))
Subject to constraints:
Total QuantityE = 300
Total QuantityAS = 0.06* QuantityE
What is the best approach to use, or what I can read up to solve the problem?
For completeness, after some reading, found the right way to code the LP.
There is neater ways of doing it, but this works for me.
name <- c("Gen1","Gen2","Gen3")
QuantityE <- c(200,100,50)
PriceE <- c(10,12,50)
QuantityAS <- c(100,50,10)
PriceAS <- c(1,5,7)
mydata <- data.frame(name, QuantityE, PriceE , QuantityAS,PriceAS )
#System Data
EnergyDemand <- 300 #Total QuantityE
CRRequired <- 0.06*EnergyDemand #Total Quantity AS
library(lpSolve)
#Set up Objective function, prices will be the co-ef's
obj.fun <- as.vector(stack(mydata[,c(3,5)])[1])
##Set up the constraints matrix
#This will set up individual quantityE and quantityAS coef's
D <- diag(1, NROW(obj.fun),NROW(obj.fun))
#This sets up coefficients with the ability to combine QuantityAS and QuantityE
E <- diag(1, NROW(name),NROW(name))
FA <- cbind(E,E)
#This sets up the cofficients for all quantityE
G <- matrix(c(rep(1,NROW(name)),rep(0,NROW(name))),1)
#This sets up the cofficients for all quantityAS
H <- matrix(c(rep(0,NROW(name)),rep(1,NROW(name))),1)
#This combines the above constraints into one matrix
constr <- rbind(D,FA,G,H)
#Set up directional constraints. All except the last 2 are <=
#This allows flexibility in choosing volumes
# The last two have to be equal to for Energy and AS demand
constr.dir <- c(rep("<=",NROW(constr)-2), rep("=",2))
#This sets up the rhs numbers for the matrix above
rhs <- c(QuantityE, QuantityAS, pmax(QuantityE, QuantityAS), EnergyDemand,CRRequired)
#This is the algorithm parameters
prod.sol <- lp("min", obj.fun, constr, constr.dir, rhs, compute.sens = TRUE)
a <- matrix(prod.sol$solution, nrow= length(name)) #decision variables values
rownames(a) <- name
colnames(a) <- c("Energy MW", "AS MW")
#This is the Summary of results
print(mydata) #This gives the initial dataset
a # This gives the combination of quantity used from Gen's
prod.sol #This gives the optimal minimized cost
EDIT: I found out that the Matrix package does everything I need. Super fast and flexible. Specifically, the related functions are
Data <- sparseMatrix(i=Data[,1], j=Data[,2], x=Data[,3])
or simply
Data <- Matrix(data=Data,sparse=T)
Once you have your matrix in this Matrix class, everything should work smoothly like a regular matrix (for the most part, anyway).
======================================================
I have a dataset in "Long format" right now, meaning that it has 3 columns: row name, column name, and value. All of the "missing" row-column pairs are equal to zero.
I need to come up with an efficient way to calculate the cosine similarity (or even just the regular dot product) between all possible pairs of rows. The full data matrix is 19000 x 62000, which is why I need to work with the Long format instead.
I came up with the following method, but it's WAY too slow. Any tips on maximizing efficiency, or any suggestions of a better method overall, would be GREATLY appreciated. Thanks!
Data <- matrix(c(1,1,1,2,2,2,3,3,3,1,2,3,1,2,4,1,4,5,1,2,2,1,1,1,1,3,1),
ncol = 3, byrow = FALSE)
Data <- data.frame(Data)
cosine.sparse <- function(data) {
a <- Sys.time()
colnames(data) <- c('V1', 'V2', 'V3')
nvars <- length(unique(data[,2]))
nrows <- length(unique(data[,1]))
sim <- matrix(nrow=nrows, ncol=nrows)
for (i in 1:nrows) {
data.i <- data[data$V1==i,]
length.i.sq <- sum(data.i$V3^2)
for (j in i:nrows) {
data.j <- data[data$V1==j,]
length.j.sq <- sum(data.j$V3^2)
common.vars <- intersect(data.i$V2, data.j$V2)
row1 <- data.i[data.i$V2 %in% common.vars,3]
row2 <- data.j[data.j$V2 %in% common.vars,3]
cos.sim <- sum(row1*row2)/sqrt(length.i.sq*length.j.sq)
sim[i,j] <- sim[j,i] <- cos.sim
}
if (i %% 500 == 0) {cat(i, " rows have been calculated.")}
}
b <- Sys.time()
time.elapsed <- b - a
print(time.elapsed)
return(sim)
}
cosine.sparse(Data2)
I have a big dataframe with 10000 rows and 12 columns (discountdataset).
The columns contain different variables. The first 210 rows represents subject 1 (there is also a column with "subject1"), the next 210 rows represent subject 2, and so on.
I want to use jags and a loop function to loop through all 52 subjects in the dataframe, and assign a function to each of them. My code looks like this:
#subsetting the dataframe by the variable subjectid
subsetdiscount <- split(discountdataset, as.factor(discountdataset$subjectid))
Here my plan is to loop and assign the following jags function to all subjects in the subset), but, it doesn't work. I think my mistake is that the variables "nt", "Choice" that I want to pass on to jags are not defined right, or, are not updated.
library(rjags)
for (i in 1:length(subsetdiscount))
{
nt <- nrow (subsetdiscount)
Choice <- subsetdiscount$choice
amountSS <- subsetdiscount$val_basic
amountLL <- subsetdiscount$val_d
delayDIFF <- subsetdiscount$delay
con <- subsetdiscount$condition
data <- list("nt", "Choice", "amountSS", "amountLL", "delayDIFF", "con") # to be passed on to JAGS
myinits <- list(
list(k = (c(0.01, 0.01))),
list(temp = (c(6, 6))))
parameters <- c("k", "temp")
samples <- jags(data, inits=myinits, parameters,
model.file ="singlesubmodel_Ben_roundedchoice.txt", n.chains=2, n.iter=20000,
n.burnin=1, n.thin=1, DIC=T)
Try:
library(rjags)
library(R2jags)
subsetdiscount <- split(discountdataset, as.factor(discountdataset$subjectid))
output_models <- lapply(subsetdiscount, function(x) {
nt <- nrow(x)
Choice <- x$choice
amountSS <- x$val_basic
amountLL <- x$val_d
delayDIFF <- x$delay
con <- x$condition
data <- list("nt", "Choice", "amountSS", "amountLL", "delayDIFF", "con") # to be passed on to JAGS
myinits <- list(list(k = (c(0.01, 0.01))),
list(temp = (c(6, 6))))
parameters <- c("k", "temp")
samples <- jags(data, inits=myinits, parameters,
model.file ="singlesubmodel_Ben_roundedchoice.txt",
n.chains=2, n.iter=20000,
n.burnin=1, n.thin=1, DIC=T)
return(samples)
})
output_models should be a list containing outputs for each of the factors you split main dataset by.
Please note that it is quite hard to test this without any provided data. So, if this fails to work, you may want to provide some data for testing.
I hope it helps.