I have some code which uses a loop to calculate a water balance for catchments (watersheds) for 8 catchments. I would like the loop to write it's output to a dataframe in R but the only way I can work out how to do this is to write it to csv (inside the loop), then outside of the loop read each of the csv files separately. I feel as though there could be a better way to do this - any ideas?
This is my code: (note it is part of a shiny app, hence my desire to avoid reading and writing csv files)
WB_catchments <- function (){
for (i in 1:8){
file_name <- gsub(" ", "", paste("outputs\\", Lake_name[i], "_catchment_water_balance.csv"))
p <- pts()[[i]]
Rain_in_WB <- RAIN() %>% filter(Grid_id %in% p)
Rain_in_WB$Grid_id <- NULL #remove Grid_id column
Rain_in_WB <- colSums(Rain_in_WB, na.rm = TRUE) # sum over catchment
AET_out_WB <- AET() %>% filter(Grid_id %in% p)
AET_out_WB$Grid_id <- NULL #remove Grid_id column
AET_out_WB <- -1*colSums(AET_out_WB, na.rm = TRUE) # sum over catchment and multiply by -1 as is an output
Evap_WB <- -1*EVAP_lakes[i,]
SW_in_WB <- SW_in_C[i,]
GW_in_WB <- GW_in_C[i,]
SW_out_WB <- -1*SW_out_C[i,]
GW_out_WB <- -1*GW_out_C[i,]
stor_WB <- STOR[i,]
out_catchment <- -1*outside[i,]
bal <- as.data.frame(cbind(WY, Rain_in_WB, SW_in_WB, GW_in_WB, AET_out_WB, Evap_WB, SW_out_WB, GW_out_WB, stor_WB, out_catchment))
bal <- mutate(bal, "res" = rowSums(bal[,2:10], na.rm = TRUE))
colnames(bal) <- c("WaterYear", "Rain", "SW_in", "GW_in", "AET", "Evap", "SW_out", "GW_out", "Storage", "Water_out_of_Greater_Tarawera_Catchments", "Residual")
write.csv(bal, file_name)
}
}
WB_catchments()
Okareka_WB_C <- read.csv("outputs\\Okareka_catchment_water_balance.csv")
Okaro_WB_C <- read.csv("outputs\\Okaro_catchment_water_balance.csv")
Okataina_WB_C <- read.csv("outputs\\Okataina_catchment_water_balance.csv")
Rerewhakaaitu_WB_C <- read.csv("outputs\\Rerewhakaaitu_catchment_water_balance.csv")
Rotokakahi_WB_C <- read.csv("outputs\\Rotokakahi_catchment_water_balance.csv")
Rotomahana_WB_C <- read.csv("outputs\\Rotomahana_catchment_water_balance.csv")
Tarawera_WB_C <- read.csv("outputs\\Tarawera_catchment_water_balance.csv")
Tikitapu_WB_C <- read.csv("outputs\\Tikitapu_catchment_water_balance.csv")
Instead of posting some very special code snippets, it is in most cases to post a toy example. Here an artificial example how to fill a data frame in a loop. As R is a vectorized language, it is often to avoid a loop at all. Compare the two cases below:
## number of cases
N <- 10
### looped version =====
df <- data.frame(
rain=rep(0, N),
evap=rep(0, N)
)
for (i in 1:N) {
# instead of runif, do your calculations
# ...
rain <- runif(1, min=0, max=10)
evap <- runif(1, min=1, max=5)
df[i, ] <- c(rain, evap)
}
df
### vectorized version =====
rain <- runif(N, min=0, max=10)
evap <- runif(N, min=1, max=5)
df2 <- data.frame(
rain=rain,
evap=evap
)
df2
If your calculations return more than one row in each iteration and you don't know beforehand how many, grow the data frame like this:
## empty data frame
df3 <- data.frame(
rain=NULL,
evap=NULL
)
for (i in 1:N) {
# instead of runif, do your calculations
# ...
rain <- runif(7, min=0, max=10)
evap <- runif(7, min=1, max=5)
df3 <- rbind(df3, cbind(rain, evap))
}
df3
Edit: Create several data frames (as elements of a list)
If separate data frames are needed, it is a good idea to put them together in a list. INstead of a loop,l we can use lapply:
create_df <- function(i) {
# optionally: do something with i, e.g. select file name
rain <- runif(7, min=0, max=10)
evap <- runif(7, min=1, max=5)
df <- data.frame(
rain=rain,
evap=evap
)
}
## lapply does the "loop" and returns a list of data frames
df_list <- lapply(1:8, create_df)
df_list[[7]] # returns 7th data frame
Another way I got this working was by using assign(file_name, bal, envir = .GlobalEnv) instead of write.csv(bal, file_name) in the last line of my function
Related
Hello I am trying to speed up a block of code that is currently working, but is quite slow with the amount of data that I have. I need to identify the top n% highest value in a row and subsequently use this to make an average by subsetting a dataframe and averaging the values of the subset. Any help or suggestions would be appreciated. This is my current approach:
corrMat <- matrix(runif(944*9843), nrow=944, ncol = 9843)
GeneExpression <- matrix(runif(11674*9843, min=0, max=100), nrow = 11674, ncol=9843)
cutOff <- apply(corrMat, MARGIN = 1, FUN = quantile, 0.99)
topCells <- corrMat > cutOff
data <- matrix(, nrow = nrow(topCells), ncol = nrow(GeneExpression))
colnames(data) <- rownames(GeneExpression)
for(i in colnames(data)){
for(j in 1:nrow(topCells)){
data[j,i] <- mean(t(GeneExpression[i, topCells[j,]]))
}
}
data
Here's a smaller version of your example along with my base R solution. Chances are there's also a neat tidyverse way of doing this but I wouldn't know.
corrMat <- matrix(runif(24*18), nrow=24)
GeneExpression <- matrix(runif(36*18, min=0, max=100), nrow = 36)
cutOff <- apply(corrMat, MARGIN = 1, FUN = quantile, 0.99)
topCells <- corrMat > cutOff
data <- data2 <- matrix(, nrow = nrow(topCells), ncol = nrow(GeneExpression))
colnames(data) <- rownames(GeneExpression) # rownames are NULL so this is not needed
start <- Sys.time() # benchmarking
for(i in 1:ncol(data)){ # iterate by column rather than colname
for(j in 1:nrow(topCells)){
data[j,i] <- mean(t(GeneExpression[i, topCells[j,]]))
}
}
eric <- Sys.time() - start
start <- Sys.time()
# apply over rows of topCells to take row means of GeneExpression
# per row of topCells
# then just transpose
data2 <- t(apply(topCells, 1, function(x) rowMeans(GeneExpression[, x, drop = F])))
milan <- Sys.time() - start
all(data == data2)
[1] TRUE
eric
Time difference of 0.08776498 secs
milan
Time difference of 0.02593184 secs
Using your original example data, my solution takes 6.43s to run.
Hope this helps.
I am calculating a community weighted mean of functional trait values (studying forestry). I have to multiply the relative abundances of each species (tree) by the trait values. I have 2dataframes, 1 with the relative abundances of each species within each site and one with the average trait values for each species. I made a loop to automize the calculation, but the endresults return the multiplication 13 times instead of 1 time (I have 13plots, so maybe it has something to do with this) I'm already busy with this script for several days since i'm new to R, but i have to do this for my masterthesis. I think I reached my limit of logical thinking today and can't find my error :) can someone help me please? I'll paste the script below:
load data, apply some column names, fill NAs with 0
library(data.table)
traits <- read.csv("Trait value.csv", sep = ";")
plots_Maiz <- read.csv("CWM Maiz plot.csv", sep = ";")
plots_Maiz[is.na(plots_Maiz)] <- 0
colnames(plots_Maiz) <- c("site", "species","y0","y1", "y2", "y3", "y4", "y5")
traits[,1:17][is.na(traits[,1:17])] <- 0
#function for finding the corresponding species for a plot in the traitlist
traitsf <- function(df, traitlist){
plottraits <- subset(traitlist, species %in% df[,2])
return(plottraits)
}
traitcalc <- function(traits, plots_Maiz){
multlist <- list()
blist <- list()
vmult <- vector()
tickcount <- 0
plotsplit <- split.data.frame(plots_Maiz, plots_Maiz$site)
testlist <- lapply(plotsplit, traitsf, traitlist = traits)
for (q in 1:length(plotsplit)){
df1 <- testlist[[q]]
df2 <- plotsplit[[q]]
plot <- as.character(plotsplit[[q]][1,1])
for (i in 1:nrow(df1)){
v <- as.numeric(as.vector(t(df1[i,2:ncol(df1)])))
species <- as.character(df1[i,1])
for (j in 1:(ncol(df2)-2)){
tickcount <- tickcount + 1
vmult <-as.vector(v * (as.numeric(as.vector(df2[i,j+2]))))
vmult <- as.list(c(vmult, j-1, species, plot))
multlist[[tickcount]] <- vmult
}
}
b <- do.call(rbind, multlist)
b <- data.table::rbindlist(multlist)
blist[[q]] <- b
}
return(blist)
}
endresults <- traitcalc(traits,plots_Maiz)
endresultsdf2<- do.call("rbind", endresults)
My question is two fold. First, given these three data frames:
df1 <- data.frame(k1 = runif(6, min=0, max=100),
k2 = runif(6, min=0, max=100),
k3 = runif(6, min=0, max=100),
k4 = runif(6, min=0, max=100))
df2 <- data.frame(k1 = runif(6, min=0, max=100),
k2 = runif(6, min=0, max=100),
k3 = runif(6, min=0, max=100),
k4 = runif(6, min=0, max=100))
df3 <- data.frame(k1 = runif(6, min=0, max=100),
k2 = runif(6, min=0, max=100),
k3 = runif(6, min=0, max=100),
k4 = runif(6, min=0, max=100))
I would like to reformat and rename part of each data frame using this function:
samplelist<-c("k2", "k4")
draft_fxn<-function(x, obj_name){
x.selected<-x[,c(samplelist)] #select columns of choice
colnames(x.selected)[1:2]<-paste(obj_name, colnames(x.selected), sep="_") #rename columns so they include original data frame name
return(x.selected)
}
#Example run and output:
df2_final<-draft_fxn(df2, "df2")
#output from:
head(df2_final[1:2],)
> df2_k2 df2_k4
>1 5.240274 53.03423
>2 5.042926 34.78974
First question: How can I change my function so I don't have to type in ' df2, "df2" '. In my draft_fxn code, I want to replace "obj_name" with whatever the name of the input data frame is. In my example it is "df2".
Second question: How can I loop through all of my data frames? Perhaps, similar to this for loop?
objs<-c(df1, df2, df3)
for (file in objs){
out<-draft_fxn(file); return(out)
} #this doesn't work though.
To answer your first question: you can obtain the name of an object x using deparse(substitute(x)). So to eliminate the argument obj_name from your function, you could use
draft_fxn <- function(x){
obj_name <- deparse(substitute(x))
x.selected<-x[,c(samplelist)]
colnames(x.selected)[1:2]<-paste(obj_name, colnames(x.selected), sep="_") #rename columns so they include original data frame name
return(x.selected)
}
As to your second question, if you wanted to perform such an operation for multiple data frames, you would usually put them in a list and then lapply the function. In this case, however, it does not work, because the object name changes if you put the data frames into a list, i.e. deparse(substitute(x)) returns X[[i]]_instead of the name of the individual data frame. If you wanted to do it in a loop I would suggest a different approach where you pass a vector of the names of the data frames:
## Names of the relevant data frames:
objNames <- c("df1", "df2", "df3")
## Function to rename the specified columns:
renameFun <- function(xString){
x <- get(xString)[,c(samplelist)]
colnames(x) <- paste(xString, samplelist, sep = "_")
x
}
## Apply function to all data frames specifed by objNames:
lapply(objNames, renameFun)
# [[1]]
# df1_k2 df1_k4
# 1 54.232123 2.178375
# 2 16.816784 23.586760
# 3 6.612874 16.509340
# 4 92.399588 71.133637
# 5 22.917838 8.127079
# 6 43.563411 21.118758
#
# ...
So your function is not well-specified because you're defining samplelist outside of the function and then calling it inside. The problem with that is that if you don't have samplelist defined, the function will return an error, i.e. it's not self-contained.
Here's an alternative:
draft_fxn<-function(x, cols =...){
x.selected<-data.frame(x[, cols]) #select columns of choice
colnames(x.selected)<-paste(deparse(substitute(x)), colnames(x.selected), sep="_") #rename columns so they include original data frame name
return(x.selected)
}
Note that the cols argument can vary (as long as it's positive and not larger than the number of columns in your data frame).
This returns:
> df2_final<- draft_fxn(df2, cols = c("k2", "k4"))
> head(df2_final)[1:2,]
df2_k2 df2_k4
1 21.62533 2.256182
2 64.83556 67.705705
I am doing systematic calculations for my created dataframe. I have the code for the calculations but I would like to:
1) Wite it as a function and calling it for the dataframe I created.
2) reset the calculations for next ID in the dataframe.
I would appreciate your help and advice on this.
The dataframe is created in R using the following code:
#Create a dataframe
dosetimes <- c(0,6,12,18)
df <- data.frame("ID"=1,"TIME"=sort(unique(c(seq(0,30,1),dosetimes))),"AMT"=0,"A1"=NA,"WT"=NA)
doserows <- subset(df, TIME%in%dosetimes)
doserows$AMT[doserows$TIME==dosetimes[1]] <- 100
doserows$AMT[doserows$TIME==dosetimes[2]] <- 100
doserows$AMT[doserows$TIME==dosetimes[3]] <- 100
doserows$AMT[doserows$TIME==dosetimes[4]] <- 100
#Add back dose information
df <- rbind(df,doserows)
df <- df[order(df$TIME,-df$AMT),]
df <- subset(df, (TIME==0 & AMT==0)==F)
df$A1[(df$TIME==0)] <- df$AMT[(df$TIME ==0)]
#Time-dependent covariate
df$WT <- 70
df$WT[df$TIME >= 12] <- 120
#The calculations are done in a for-loop. Here is the code for it:
#values needed for the calculation
C <- 2
V <- 10
k <- C/V
#I would like this part to be written as a function
for(i in 2:nrow(df))
{
t <- df$TIME[i]-df$TIME[i-1]
A1last <- df$A1[i-1]
df$A1[i] = df$AMT[i]+ A1last*exp(-t*k)
}
head(df)
plot(A1~TIME, data=df, type="b", col="blue", ylim=c(0,150))
The other thing is that the previous code assumes the subject ID=1 for all time points. If subject ID=2 when the WT (weight) changes to 120. How can I reset the calculations and make it automated for all subject IDs in the dataframe? In this case the original dataframe would be like this:
#code:
rm(list=ls(all=TRUE))
dosetimes <- c(0,6,12,18)
df <- data.frame("ID"=1,"TIME"=sort(unique(c(seq(0,30,1),dosetimes))),"AMT"=0,"A1"=NA,"WT"=NA)
doserows <- subset(df, TIME%in%dosetimes)
doserows$AMT[doserows$TIME==dosetimes[1]] <- 100
doserows$AMT[doserows$TIME==dosetimes[2]] <- 100
doserows$AMT[doserows$TIME==dosetimes[3]] <- 100
doserows$AMT[doserows$TIME==dosetimes[4]] <- 100
df <- rbind(df,doserows)
df <- df[order(df$TIME,-df$AMT),]
df <- subset(df, (TIME==0 & AMT==0)==F)
df$A1[(df$TIME==0)] <- df$AMT[(df$TIME ==0)]
df$WT <- 70
df$WT[df$TIME >= 12] <- 120
df$ID[(df$WT>=120)==T] <- 2
df$TIME[df$ID==2] <- c(seq(0,20,1))
Thank you in advance!
In general, when doing calculations on different subject's data, I like to split the dataframe by ID, pass the vector of individual subject data into a for loop, do all the calculations, build a vector containing all the newly calculated data and then collapse the resultant and return the dataframe with all the numbers you want. This allows for a lot of control over what you do for each subject
subjects = split(df, df$ID)
forResults = vector("list", length=length(subjects))
# initialize these constants
C <- 2
V <- 10
k <- C/V
myFunc = function(data, resultsArray){
for(k in seq_along(subjects)){
df = subjects[[k]]
df$A1 = 100 # I assume this should be 100 for t=0 for each subject?
# you could vectorize this nested for loop..
for(i in 2:nrow(df)) {
t <- df$TIME[i]-df$TIME[i-1]
A1last <- df$A1[i-1]
df$A1[i] = df$AMT[i]+ A1last*exp(-t*k)
}
head(df)
# you can add all sorts of other calculations you want to do on each subject's data
# when you're done doing calculations, put the resultant into
# the resultsArray and we'll rebuild the dataframe with all the new variables
resultsArray[[k]] = df
# if you're not using RStudio, then you want to use dev.new() to instantiate a new plot canvas
# dev.new() # dont need this if you're using RStudio (which doesnt allow multiple plots open)
plot(A1~TIME, data=df, type="b", col="blue", ylim=c(0,150))
}
# collapse the results vector into a dataframe
resultsDF = do.call(rbind, resultsArray)
return(resultsDF)
}
results = myFunc(subjects, forResults)
Do you want this:
ddf <- data.frame("ID"=1,"TIME"=sort(unique(c(seq(0,30,1),dosetimes))),"AMT"=0,"A1"=NA,"WT"=NA)
myfn = function(df){
dosetimes <- c(0,6,12,18)
doserows <- subset(df, TIME%in%dosetimes)
doserows$AMT[doserows$TIME==dosetimes[1]] <- 100
doserows$AMT[doserows$TIME==dosetimes[2]] <- 100
doserows$AMT[doserows$TIME==dosetimes[3]] <- 100
doserows$AMT[doserows$TIME==dosetimes[4]] <- 100
#Add back dose information
df <- rbind(df,doserows)
df <- df[order(df$TIME,-df$AMT),]
df <- subset(df, (TIME==0 & AMT==0)==F)
df$A1[(df$TIME==0)] <- df$AMT[(df$TIME ==0)]
#Time-dependent covariate
df$WT <- 70
df$WT[df$TIME >= 12] <- 120
#The calculations are done in a for-loop. Here is the code for it:
#values needed for the calculation
C <- 2
V <- 10
k <- C/V
#I would like this part to be written as a function
for(i in 2:nrow(df))
{
t <- df$TIME[i]-df$TIME[i-1]
A1last <- df$A1[i-1]
df$A1[i] = df$AMT[i]+ A1last*exp(-t*k)
}
head(df)
plot(A1~TIME, data=df, type="b", col="blue", ylim=c(0,150))
}
myfn(ddf)
For multiple calls:
for(i in 1:N) {
myfn(ddf[ddf$ID==i,])
readline(prompt="Press <Enter> to continue...")
}
I am experimenting with R and would like to implement a loop which runs 1000000 times and creates a vector of length 10 and adds each vector to a data frame under the name cycle and the number it has iterated.
This is my current code:
loser <- 100
winner <- 500
percentageWinner <- 70
runns <- 1000000
numbs <- 10
for(i in runns ) {
randNumb <- runif(numbs, min=0, max=100)
outcome <- ifelse(randNumb < percentageWinner, winner, loser) # true are winners and false are losers
df <- data.frame(outcome)
colnames(df)[which(names(df) == "outcome")] <- paste("cycle",i)
}
df
I am struggeling to add the vector next to the other data.frame column.
Any suggestions, how to do that?
I appreciate your replies!
In your code, at each iteration of your for loop, you overwrite i by 1 (i <- 1). And if you remove it, it will be always equal to runns, i.e only 1 loop.
You need to change your code for something like:
loser <- 100
winner <- 500
percentageWinner <- 70
runns <- 1000000
numbs <- 10
outcome <- matrix(NA, numbs, runns)
for(i in seq_len(runns)) {
randNumb <- runif(numbs, min=0, max=100)
outcome[,i] <- ifelse(randNumb < percentageWinner, winner, loser)
}
df <- data.frame(outcome)
colnames(df) <- paste0("cycle",seq_len(runns))
Or you can avoid the loop:
randNumb <- runif(numbs*runns, min=0, max=100)
outcome <- ifelse(randNumb < percentageWinner, winner, loser)
outcome <- matrix(outcome, numbs, runns)
df <- data.frame(outcome)
colnames(df) <- paste0("cycle",seq_len(runns))