First let me say that I am not an expert coder and any advice about this particular question or my general technique will be greatly appreciated.
I have a large data set that is made up of similar data frames named Table6.# such as: Table6.1, Table6.2, ect. I have variables in each data frame that repeat as well, such as: ST1_Delta_PV%, ST2_Delta_PV%, ect. and ST1_Realloc_Margin, ST2_Reallocation_Margin, ect.
I am trying to write several nested loops that will calculated values in each table across these similar variables. I have tried to do this with the paste function as shown below, but this is obviously not the correct way to do this.
for (i in 1:25){
for (j in 1:4){
for (k in 1:length(paste("Table6.",i,"sep="")[,1]){
paste("Table6.",i,sep="")$paste("ST",j,"NonTgt_Shr",sep="")[k] <- paste("Table6.",i,sep="")$paste("ST",j,"_Delta_PV%",sep="")[k] * paste("Table6.",i,sep="")$paste("ST",j,"_Reallocation_Margin",sep="")[k]
}
}
}
I apologize if this is a complete mess. I appreciate your help.
As akrun says, you should put your data frames in a list
Tables <- list(Table6.1, Table6.2, …)
for (Table in Tables) { … }
This way, you do not need to use paste to construct the different Table names.
For accessing the different columns, you can use the df["column"] syntax - this is similar to df$column, except that inside the brackets, you can use any string
nonTgt_Shr.column.name <- paste0("ST",j,"NonTgt_Shr")
delta.column.name <- paste0("ST",j,"_Delta_PV%")
for (k in 1:nrow(Table) {
Table[nonTgt_Shr.column.name][k] <- Table[delta.column.name][k] * …
}
Note how I use variables for storing the name, making the line with the actual computation much more readable.
Also, nrow is more intuitive than length(Table[,1]).
The calculations could be transformed into a function which improves readability, scaling and
robustness
In the actual calculation function, the function get is used to retrieve the data frame based on the name.
#Calculation Function
fn_CalcVariables <- function(
tableName="Table6.1",
outputVarName="NonTgt_Shr",
inputVarNames=c("_Delta_PV%", "_Reallocation_Margin"),
variablePrefix="ST1"
) {
DF <- get(tableName)
outputVarName <- paste0(variablePrefix, outputVarName)
inputVarNames <- paste0(variablePrefix, inputVarNames)
DF[,outputVarName] <- DF[,inputVarNames[1]] * DF[,inputVarNames[2]]
return(DF)
}
This function should by called by nested lapply calls.
lapply iterates over the lists of the arguments, calls the function (second argument), and collects a list of the return values.
(As an exercise, try l <- list(a=1, b=2); lapply(l, function(x) { x*2 }).)
#List object names for tables and variable names
tableNamesList <- paste0("Table6.",1:25)
variablePrefixList <- paste0("ST",1:4)
#Nested loops to invoke custom function from above
lapply(variablePrefixList, function(alpha) {
lapply(tableNamesList, function(x, varprefix=alpha) {
cat("Begin Processing Table",x,"varPrefix",varprefix,"\n")
fn_CalcVariables(
tableName=x,
outputVarName="NonTgt_Shr",
inputVarNames=c("_Delta_PV%","_Reallocation_Margin"),
variablePrefix=varprefix
)
cat("End Processing Table", x, "varPrefix", varprefix, "\n")
}) #End of innner lapply
}) #End of outer lapply
Related
I am trying to create a vector or list of values based on the output of a function performed on individual elements of a column.
library(hpoPlot)
xyz_hpo <- c("HP:0003698", "HP:0007082", "HP:0006956")
getallancs <- function(hpo_col) {
for (i in 1:length(hpo_col)) {
anc <- get.ancestors(hpo.terms, hpo_col[i])
output <- list()
output[[length(anc) + 1]] <- append(output, anc)
}
return(anc)
}
all_ancs <- getallancs(xyz_hpo)
get.ancestors outputs a character vector of variable length depending on each term. How can I loop through hpo_col adding the length of each ancs vector to the output vector?
Welcome to Stack Overflow :) Great job on providing a minimal reproducible example!
As mentioned in the comments, you need to move the output <- list() outside of your for loop, and return it after the loop. At present it is being reset for each iteration of the loop, which is not what you want. I also think you want to return a vector rather than a list, so I have changed the type of output.
Also, in your original question, you say that you want to return the length of each anc vector in the loop, so I have changed the function to output the length of each iteration, rather than the whole vector.
getallancs <- function(hpo_col) {
output <- numeric()
for (i in 1:length(hpo_col)) {
anc <- get.ancestors(hpo.terms, hpo_col[i])
output <- append(output, length(anc))
}
return(output)
}
If you are only doing this for a few cases, such as your example, this approach will be fine, however, this paradigm is typically quite slow in R and it's better to try and vectorise this style of calculation if possible. This is especially important if you are running this for a large number of elements where computation will take more than a few seconds.
For example, one way the function above could be vectorised is like so:
all_ancs <- sapply(xyz_hpo, function(x) length(get.ancestors(hpo.terms, x)))
If in fact you did mean to output the whole vector of anc, not just the lengths, the original function would look like this:
getallancs <- function(hpo_col) {
output <- character()
for (i in 1:length(hpo_col)) {
anc <- get.ancestors(hpo.terms, hpo_col[i])
output <- c(output, anc)
}
return(output)
}
Or a vectorised version could be
all_ancs <- unlist(lapply(xyz_hpo, function(x) get.ancestors(hpo.terms, x)))
Hope that helps. If it solves your problem, please mark this as the answer.
I want to generate a dataframe with summary statistics (AUC, Gini, RMSE, etc.) from validation of multiple models on multiple datasets.
I've got x number of models (classifiers - gbm, xgb, rf, etc. - all built in caret package) that are enclosed in ListOfModels, and y number of datasets (dataframes with identical variables over several data points) that are enclosed in ListOfDatasets.
I can create a short version of the desired dataframe by running a custom function fun_modelStats (that extracts model stats using model and dataset as arguments) inside ldply - but can do so only either over a ListOfModels and just one specific dataset or over a ListOfDatasets and just one specific model, like this:
modelStats_by_model <- ldply(ListOfModels, function(model) {
modelStats <- fun_modelStats(model, B97_2012SU_2013)
})
and
modelStats_by_dataset <- ldply(ListOfDatasets, function(dataset) {
modelStats <- fun_modelStats(gbmFit1, dataset)
})
The resulting dataframe with models' stats has either x or y number of rows, and I can't get my head around the way of building this dataframe with x*y rows, i.e. stats from all models validated on all datasets.
I did experiment with Map and mapply, and for loop, but to no avail.
Using Map I get weird incorrect output:
modelStats_all <- Map(fun_modelStats, ListOfModels, ListOfDatasets)
The for loop does generate the desired output with this code below, but only as plain text in console whereas I need it as a dataframe.
for(i in names(ListOfModels)) {
for(j in names(ListOfDatasets)) {
modelStats <- fun_modelStats(ListOfModels[[i]], ListOfDatasets[[j]])
print(modelStats)
}
}
Many thanks in advance for help!
P.S. Further search at SO (How to write a function that takes a model as an argument in R - this post, for example) shows that using aggregate.formula or aggregate.data.frame or rbind.data.frame could help, but I can't figure out how.
Here is the solution, in case anyone faces a similar problem:
fun_multiModelStats <- function(ListOfModels, ListOfDatasets) {
multiModelStats <- data.frame()
for(i in names(ListOfModels)) {
for(j in names(ListOfDatasets)) {
model <- ListOfModels[[i]]
dataset <- ListOfDatasets[[j]]
modelStats <- fun_modelStats(model, dataset)
modelName <- names(ListOfModels[i])
datasetName <- names(ListOfDatasets[j])
modelStats <- cbind(modelName, datasetName, modelStats)
multiModelStats <- rbind(multiModelStats, modelStats)
}
}
return(multiModelStats)
}
Yet, I would like to find a solution without double for loops but rather with something from the apply family of functions.
What will happen if you'll modify your for loop as following:
modelStats <- data.frame()
for(i in names(ListOfModels)) {
for(j in names(ListOfDatasets)) {
modelStats[i,j] <- fun_modelStats(ListOfModels[[i]], ListOfDatasets[[j]])
print(modelStats)
}
}
I'm using lapply to loop through a list of dataframes and apply the same set of functions. This works fine when lapply has just one function, but I'm struggling to see how I store/print the output from multiple functions - in that case, I seem to only get output from one 'loop'.
So this:
output <- lapply(dflis,function(lismember) vss(ISEQData,n=9,rotate="oblimin",diagonal=F,fm="ml"))
works, while the following doesn't:
output <- lapply(dflis,function(lismember){
outputvss <- vss(lismember,n=9,rotate="oblimin",diagonal=F,fm="ml")
nefa <- (EFA.Comp.Data(Data=lismember, F.Max=9, Graph=T))
})
I think this dummy example is an analogue, so in other words:
nbs <- list(1==1,2==2,3==3,4==4)
nbsout <- lapply(nbs,function(x) length(x))
Gives me something I can access, while I can't see how to store output using the below (e.g. the attempt to use nbsout[[x]][2]):
nbs <- list(1==1,2==2,3==3,4==4)
nbsout <- lapply(nbs,function(x){
nbsout[[x]][1]<-typeof(x)
nbsout[[x]][2]<-length(x)
}
)
I'm using RStudio and will then be printing outputs/knitting html (where it makes sense to display the results from each dataset together, rather than each function-output for each dataset sequentially).
You should return a structure that include all your outputs. Better to return a named list. You can also return a data.frame if your outputs have all the same dimensions.
otutput <- lapply(dflis,function(lismember){
outputvss <- vss(lismember,n=9,rotate="oblimin",diagonal=F,fm="ml")
nefa <- (EFA.Comp.Data(Data=lismember, F.Max=9, Graph=T))
list(outputvss=outputvss,nefa=nefa)
## or data.frame(outputvss=outputvss,nefa=nefa)
})
When you return a data.frame you can use sapply that simply outputs the final result to a big data.frame. Or you can use the classical:
do.call(rbind,output)
to aggregate the result.
A function should always have an explicit return value, e.g.
output <- lapply(dflis,function(lismember){
outputvss <- vss(lismember,n=9,rotate="oblimin",diagonal=F,fm="ml")
nefa <- (EFA.Comp.Data(Data=lismember, F.Max=9, Graph=T))
#return value:
list(outputvss, nefa)
})
output is then a list of lists.
I am using the extract function in a loop. See below.
for (i in 1:length(list_shp_Tanzania)){
LU_Mod2000<- extract(x=rc_Mod2000_LC, y=list_shp_Tanzania[[i]], fun=maj)
}
Where maj function is:
maj <- function(x){
y <- as.numeric(names(which.max(table(x))))
return(y)
}
I was expecting to get i outputs, but I get only one output once the loop is done. Somebody knows what I am doing wrong. Thanks.
One solution in this kind of situation is to create a list and then assign the result of each iteration to the corresponding element of the list:
LU_Mod2000 <- vector("list", length(list_shp_Tanzania))
for (i in 1:length(list_shp_Tanzania)){
LU_Mod2000[[i]] <- extract(x=rc_Mod2000_LC, y=list_shp_Tanzania[[i]], fun=maj)
}
Do not do
LU_Mod2000 <- c(LU_Mod2000, extract(x=rc_Mod2000_LC, y=list_shp_Tanzania[[i]], fun=maj))
inside the loop. This will create unnecessary copies and will take long to run. Use the list method, and after the loop, convert the list of results to the desired format (usually using do.call(LU_Mod2000, <some function>))
Alternatively, you could substitute the for loop with lapply, which is what many people seem to prefer
LU_Mod2000 <- lapply(list_shp_Tanzania, function(z) extract(x=rc_Mod2000_LC, y=z, fun=maj))
So, I built a function called sort.song.
My goal with this function is to randomly sample the rows of a data.frame (DATA) and then filter it out (DATA.NEW) to analyse it. I want to do it multiple times (let's say 10 times). By the end, I want that each object (mantel.something) resulted from this function to be saved in my workspace with a name that I can relate to each cycle (mantel.something1, mantel.somenthing2...mantel.something10).
I have the following code, so far:
sort.song<-function(DATA){
require(ade4)
for(i in 1:10){ # Am I using for correctly here?
DATA.NEW <- DATA[sample(1:nrow(DATA),replace=FALSE),]
DATA.NEW <- DATA.NEW[!duplicated(DATA.NEW$Point),]
coord.dist<-dist(DATA.NEW[,4:5],method="euclidean")
num.notes.dist<-dist(DATA.NEW$Num_Notes,method="euclidean")
songdur.dist<-dist(DATA.NEW$Song_Dur,method="euclidean")
hfreq.dist<-dist(DATA.NEW$High_Freq,method="euclidean")
lfreq.dist<-dist(DATA.NEW$Low_Freq,method="euclidean")
bwidth.dist<-dist(DATA.NEW$Bwidth_Song,method="euclidean")
hfreqlnote.dist<-dist(DATA.NEW$HighFreq_LastNote,method="euclidean")
mantel.numnotes[i]<<-mantel.rtest(coord.dist,num.notes.dist,nrepet=1000)
mantel.songdur[i]<<-mantel.rtest(coord.dist,songdur.dist,nrepet=1000)
mantel.hfreq[i]<<-mantel.rtest(coord.dist,hfreq.dist,nrepet=1000)
mantel.lfreq[i]<<-mantel.rtest(coord.dist,lfreq.dist,nrepet=1000)
mantel.bwidth[i]<<-mantel.rtest(coord.dist,bwidth.dist,nrepet=1000)
mantel.hfreqlnote[i]<<-mantel.rtest(coord.dist,hfreqlnote.dist,nrepet=1000)
}
}
Could someone please help me to do it the right way?
I think I'm not assigning the cycles correctly for each mantel.somenthing object.
Many thanks in advance!
The best way to implement what you are trying to do is through a list. You can even make it take two indices, the first for the iterations, the second for the type of analysis.
mantellist <- as.list(1:10) ## initiate list with some values
for (i in 1:10){
...
mantellist[[i]] <- list(numnotes=mantel.rtest(coord.dist,num.notes.dist,nrepet=1000),
songdur=mantel.rtest(coord.dist,songdur.dist,nrepet=1000),
hfreq=mantel.rtest(coord.dist,hfreq.dist,nrepet=1000),
...)
}
return(mantellist)
In this way you can index your specific analysis for each iteration in an intuitive way:
mantellist[[2]][['hfreq']]
mantellist[[2]]$hfreq ## alternative
EDIT by Mohr:
Just for clarification...
So, according to your suggestion the code should be something like this:
sort.song<-function(DATA){
require(ade4)
mantellist <- as.list(1:10)
for(i in 1:10){
DATA.NEW <- DATA[sample(1:nrow(DATA),replace=FALSE),]
DATA.NEW <- DATA.NEW[!duplicated(DATA.NEW$Point),]
coord.dist<-dist(DATA.NEW[,4:5],method="euclidean")
num.notes.dist<-dist(DATA.NEW$Num_Notes,method="euclidean")
songdur.dist<-dist(DATA.NEW$Song_Dur,method="euclidean")
hfreq.dist<-dist(DATA.NEW$High_Freq,method="euclidean")
lfreq.dist<-dist(DATA.NEW$Low_Freq,method="euclidean")
bwidth.dist<-dist(DATA.NEW$Bwidth_Song,method="euclidean")
hfreqlnote.dist<-dist(DATA.NEW$HighFreq_LastNote,method="euclidean")
mantellist[[i]] <- list(numnotes=mantel.rtest(coord.dist,num.notes.dist,nrepet=1000),
songdur=mantel.rtest(coord.dist,songdur.dist,nrepet=1000),
hfreq=mantel.rtest(coord.dist,hfreq.dist,nrepet=1000),
lfreq=mantel.rtest(coord.dist,lfreq.dist,nrepet=1000),
bwidth=mantel.rtest(coord.dist,bwidth.dist,nrepet=1000),
hfreqlnote=mantel.rtest(coord.dist,hfreqlnote.dist,nrepet=1000)
)
}
return(mantellist)
}
You can achieve your objective of repeating this exercise 10 (or more times) without using an explicit for-loop. Rather than have the function run the loop, write the sort.song function to run one iteration of the process, then you can use replicate to repeat that process however many times you desire.
It is generally good practice not to create a bunch of named objects in your global environment. Instead, you can hold of the results of each iteration of this process in a single object. replicate will return an array (if possible) otherwise a list (in the example below, a list of lists). So, the list will have 10 elements (one for each iteration) and each element will itself be a list containing named elements corresponding to each result of mantel.rtest.
sort.song<-function(DATA){
DATA.NEW <- DATA[sample(1:nrow(DATA),replace=FALSE),]
DATA.NEW <- DATA.NEW[!duplicated(DATA.NEW$Point),]
coord.dist <- dist(DATA.NEW[,4:5],method="euclidean")
num.notes.dist <- dist(DATA.NEW$Num_Notes,method="euclidean")
songdur.dist <- dist(DATA.NEW$Song_Dur,method="euclidean")
hfreq.dist <- dist(DATA.NEW$High_Freq,method="euclidean")
lfreq.dist <- dist(DATA.NEW$Low_Freq,method="euclidean")
bwidth.dist <- dist(DATA.NEW$Bwidth_Song,method="euclidean")
hfreqlnote.dist <- dist(DATA.NEW$HighFreq_LastNote,method="euclidean")
return(list(
numnotes = mantel.rtest(coord.dist,num.notes.dist,nrepet=1000),
songdur = mantel.rtest(coord.dist,songdur.dist,nrepet=1000),
hfreq = mantel.rtest(coord.dist,hfreq.dist,nrepet=1000),
lfreq = mantel.rtest(coord.dist,lfreq.dist,nrepet=1000),
bwidth = mantel.rtest(coord.dist,bwidth.dist,nrepet=1000),
hfreqlnote = mantel.rtest(coord.dist,hfreqlnote.dist,nrepet=1000)
))
}
require(ade4)
replicate(10, sort.song(DATA))