Related
First let me say that I am not an expert coder and any advice about this particular question or my general technique will be greatly appreciated.
I have a large data set that is made up of similar data frames named Table6.# such as: Table6.1, Table6.2, ect. I have variables in each data frame that repeat as well, such as: ST1_Delta_PV%, ST2_Delta_PV%, ect. and ST1_Realloc_Margin, ST2_Reallocation_Margin, ect.
I am trying to write several nested loops that will calculated values in each table across these similar variables. I have tried to do this with the paste function as shown below, but this is obviously not the correct way to do this.
for (i in 1:25){
for (j in 1:4){
for (k in 1:length(paste("Table6.",i,"sep="")[,1]){
paste("Table6.",i,sep="")$paste("ST",j,"NonTgt_Shr",sep="")[k] <- paste("Table6.",i,sep="")$paste("ST",j,"_Delta_PV%",sep="")[k] * paste("Table6.",i,sep="")$paste("ST",j,"_Reallocation_Margin",sep="")[k]
}
}
}
I apologize if this is a complete mess. I appreciate your help.
As akrun says, you should put your data frames in a list
Tables <- list(Table6.1, Table6.2, …)
for (Table in Tables) { … }
This way, you do not need to use paste to construct the different Table names.
For accessing the different columns, you can use the df["column"] syntax - this is similar to df$column, except that inside the brackets, you can use any string
nonTgt_Shr.column.name <- paste0("ST",j,"NonTgt_Shr")
delta.column.name <- paste0("ST",j,"_Delta_PV%")
for (k in 1:nrow(Table) {
Table[nonTgt_Shr.column.name][k] <- Table[delta.column.name][k] * …
}
Note how I use variables for storing the name, making the line with the actual computation much more readable.
Also, nrow is more intuitive than length(Table[,1]).
The calculations could be transformed into a function which improves readability, scaling and
robustness
In the actual calculation function, the function get is used to retrieve the data frame based on the name.
#Calculation Function
fn_CalcVariables <- function(
tableName="Table6.1",
outputVarName="NonTgt_Shr",
inputVarNames=c("_Delta_PV%", "_Reallocation_Margin"),
variablePrefix="ST1"
) {
DF <- get(tableName)
outputVarName <- paste0(variablePrefix, outputVarName)
inputVarNames <- paste0(variablePrefix, inputVarNames)
DF[,outputVarName] <- DF[,inputVarNames[1]] * DF[,inputVarNames[2]]
return(DF)
}
This function should by called by nested lapply calls.
lapply iterates over the lists of the arguments, calls the function (second argument), and collects a list of the return values.
(As an exercise, try l <- list(a=1, b=2); lapply(l, function(x) { x*2 }).)
#List object names for tables and variable names
tableNamesList <- paste0("Table6.",1:25)
variablePrefixList <- paste0("ST",1:4)
#Nested loops to invoke custom function from above
lapply(variablePrefixList, function(alpha) {
lapply(tableNamesList, function(x, varprefix=alpha) {
cat("Begin Processing Table",x,"varPrefix",varprefix,"\n")
fn_CalcVariables(
tableName=x,
outputVarName="NonTgt_Shr",
inputVarNames=c("_Delta_PV%","_Reallocation_Margin"),
variablePrefix=varprefix
)
cat("End Processing Table", x, "varPrefix", varprefix, "\n")
}) #End of innner lapply
}) #End of outer lapply
I've got a list of strings list = c("string_1", "string_2", ...) and, based on this vector, I know there are dataframes named df_string_1, df_string_2, ... in my environment.
My code is like:
for (i in 1:length(list)){
res = function(i) // dataframe depending on i, same ncol as df_string_i
rbind(df_list[i],res) // that's the line I don't know how to code
}
I can't find a way to get the dataframe df_string_i at each iteration.
My attempt was to get its name with paste("df_",list[i],sep="") but then, what can I do with this string as I need the variable to be in the rbind?
Thanks for your help!
If you have a situation with data.frames named df_string_1, ..., df_string_n, you would be better off, in the long run, storing these in a list and using tools such as lapply. To solve the current issue, use get:
for (i in 1:length(list)){
res = function(i) // dataframe depending on i, same ncol as df_string_i
rbind(get(paste0("df_",list[i])),res)
}
I am new to R, so im sorry if it is not a good question.
I have several data frames called matrix1, matrix2, etc.
I want to use these 2 commands in a loop for all of them:
A1=as.matrix(matrix1)
B1=graph.adjacency(A1,mode="directed",weighted=NULL,diag=FALSE)
but I cannot figure out how to get the loop to change the names of the matrices.
Thank you in advance!
You can use get to get a variable by its name.
e.g.
for (i in 1:n) {
A1 = as.matrix(get(paste0('matrix', i)))
B1 = graph.adjacency(A1,mode="directed",weighted=NULL,diag=FALSE)
}
If you want to store the B1s, you could do so using (for example) a list:
Bs <- lapply(1:n, function (i) {
A1 = ...
B1 = ...
return(B1)
})
Then Bs[[i]] will contain the B1 of matrix i.
And then, a further improvement - rather than manually naming all your matrices matrix1, matrix2, ... , matrix10000 (particularly if you have a lot of them!), it would be better to store them in a list, e.g. As[[i]] is matrixi. (I can't give you specific code on how to do this, as it depends on where your matrices come from/how they are populated. e.g. you might lapply(list_of_filenames, read.csv) to read all the matrices from a list of file names).
Then you can:
Bs <- lapply(As, graph.adjacency, mode="directed", weighted=NULL, diag=FALSE)
without resorting to get.
Use assign() to create matrices/data.frames in loops. Use get() when calling a numbered matrix/data.frame in your loop.
for (i in 1:n) {
assign(paste0("A", i), unname(as.matrix(get(paste0("matrix", i)))))
assign(paste0("B", i), graph.adjacency(get(paste0("A", i)),
mode = "directed",
weighted = NULL,
diag = FALSE))
}
So, I built a function called sort.song.
My goal with this function is to randomly sample the rows of a data.frame (DATA) and then filter it out (DATA.NEW) to analyse it. I want to do it multiple times (let's say 10 times). By the end, I want that each object (mantel.something) resulted from this function to be saved in my workspace with a name that I can relate to each cycle (mantel.something1, mantel.somenthing2...mantel.something10).
I have the following code, so far:
sort.song<-function(DATA){
require(ade4)
for(i in 1:10){ # Am I using for correctly here?
DATA.NEW <- DATA[sample(1:nrow(DATA),replace=FALSE),]
DATA.NEW <- DATA.NEW[!duplicated(DATA.NEW$Point),]
coord.dist<-dist(DATA.NEW[,4:5],method="euclidean")
num.notes.dist<-dist(DATA.NEW$Num_Notes,method="euclidean")
songdur.dist<-dist(DATA.NEW$Song_Dur,method="euclidean")
hfreq.dist<-dist(DATA.NEW$High_Freq,method="euclidean")
lfreq.dist<-dist(DATA.NEW$Low_Freq,method="euclidean")
bwidth.dist<-dist(DATA.NEW$Bwidth_Song,method="euclidean")
hfreqlnote.dist<-dist(DATA.NEW$HighFreq_LastNote,method="euclidean")
mantel.numnotes[i]<<-mantel.rtest(coord.dist,num.notes.dist,nrepet=1000)
mantel.songdur[i]<<-mantel.rtest(coord.dist,songdur.dist,nrepet=1000)
mantel.hfreq[i]<<-mantel.rtest(coord.dist,hfreq.dist,nrepet=1000)
mantel.lfreq[i]<<-mantel.rtest(coord.dist,lfreq.dist,nrepet=1000)
mantel.bwidth[i]<<-mantel.rtest(coord.dist,bwidth.dist,nrepet=1000)
mantel.hfreqlnote[i]<<-mantel.rtest(coord.dist,hfreqlnote.dist,nrepet=1000)
}
}
Could someone please help me to do it the right way?
I think I'm not assigning the cycles correctly for each mantel.somenthing object.
Many thanks in advance!
The best way to implement what you are trying to do is through a list. You can even make it take two indices, the first for the iterations, the second for the type of analysis.
mantellist <- as.list(1:10) ## initiate list with some values
for (i in 1:10){
...
mantellist[[i]] <- list(numnotes=mantel.rtest(coord.dist,num.notes.dist,nrepet=1000),
songdur=mantel.rtest(coord.dist,songdur.dist,nrepet=1000),
hfreq=mantel.rtest(coord.dist,hfreq.dist,nrepet=1000),
...)
}
return(mantellist)
In this way you can index your specific analysis for each iteration in an intuitive way:
mantellist[[2]][['hfreq']]
mantellist[[2]]$hfreq ## alternative
EDIT by Mohr:
Just for clarification...
So, according to your suggestion the code should be something like this:
sort.song<-function(DATA){
require(ade4)
mantellist <- as.list(1:10)
for(i in 1:10){
DATA.NEW <- DATA[sample(1:nrow(DATA),replace=FALSE),]
DATA.NEW <- DATA.NEW[!duplicated(DATA.NEW$Point),]
coord.dist<-dist(DATA.NEW[,4:5],method="euclidean")
num.notes.dist<-dist(DATA.NEW$Num_Notes,method="euclidean")
songdur.dist<-dist(DATA.NEW$Song_Dur,method="euclidean")
hfreq.dist<-dist(DATA.NEW$High_Freq,method="euclidean")
lfreq.dist<-dist(DATA.NEW$Low_Freq,method="euclidean")
bwidth.dist<-dist(DATA.NEW$Bwidth_Song,method="euclidean")
hfreqlnote.dist<-dist(DATA.NEW$HighFreq_LastNote,method="euclidean")
mantellist[[i]] <- list(numnotes=mantel.rtest(coord.dist,num.notes.dist,nrepet=1000),
songdur=mantel.rtest(coord.dist,songdur.dist,nrepet=1000),
hfreq=mantel.rtest(coord.dist,hfreq.dist,nrepet=1000),
lfreq=mantel.rtest(coord.dist,lfreq.dist,nrepet=1000),
bwidth=mantel.rtest(coord.dist,bwidth.dist,nrepet=1000),
hfreqlnote=mantel.rtest(coord.dist,hfreqlnote.dist,nrepet=1000)
)
}
return(mantellist)
}
You can achieve your objective of repeating this exercise 10 (or more times) without using an explicit for-loop. Rather than have the function run the loop, write the sort.song function to run one iteration of the process, then you can use replicate to repeat that process however many times you desire.
It is generally good practice not to create a bunch of named objects in your global environment. Instead, you can hold of the results of each iteration of this process in a single object. replicate will return an array (if possible) otherwise a list (in the example below, a list of lists). So, the list will have 10 elements (one for each iteration) and each element will itself be a list containing named elements corresponding to each result of mantel.rtest.
sort.song<-function(DATA){
DATA.NEW <- DATA[sample(1:nrow(DATA),replace=FALSE),]
DATA.NEW <- DATA.NEW[!duplicated(DATA.NEW$Point),]
coord.dist <- dist(DATA.NEW[,4:5],method="euclidean")
num.notes.dist <- dist(DATA.NEW$Num_Notes,method="euclidean")
songdur.dist <- dist(DATA.NEW$Song_Dur,method="euclidean")
hfreq.dist <- dist(DATA.NEW$High_Freq,method="euclidean")
lfreq.dist <- dist(DATA.NEW$Low_Freq,method="euclidean")
bwidth.dist <- dist(DATA.NEW$Bwidth_Song,method="euclidean")
hfreqlnote.dist <- dist(DATA.NEW$HighFreq_LastNote,method="euclidean")
return(list(
numnotes = mantel.rtest(coord.dist,num.notes.dist,nrepet=1000),
songdur = mantel.rtest(coord.dist,songdur.dist,nrepet=1000),
hfreq = mantel.rtest(coord.dist,hfreq.dist,nrepet=1000),
lfreq = mantel.rtest(coord.dist,lfreq.dist,nrepet=1000),
bwidth = mantel.rtest(coord.dist,bwidth.dist,nrepet=1000),
hfreqlnote = mantel.rtest(coord.dist,hfreqlnote.dist,nrepet=1000)
))
}
require(ade4)
replicate(10, sort.song(DATA))
So, my goal is to write a function that will take as input any csv file, an output path, and an arbitrary number of split sizes (by number of rows), and then randomize and split the data into the appropriate files. I could really easily do this manually if I know the split sizes ahead of time, but I want an automated function that will handle varying split sizes. Seems straightforward, and here's what I had written:
randomizer = function(startFile, endPath, ...){ ##where ... are the user-defined split sizes
vec = unlist(list(...))
n_files = length(vec)
values = read.csv(startFile, stringsAsFactors = FALSE)
values_rand = as.data.frame(values[sample(nrow(values)),])
for(i in 1:n_files){
if(nrow(values_rand)!=0 & !is.null(nrow(values_rand))){
assign(paste('group', i , sep=''), values_rand[1:vec[i], ]);
values_rand = as.data.frame(values_rand[(vec[i]+1):nrow(values_rand), ], stringsAsFactors = FALSE)
## (A) write.csv fn here?
} else {
print("something went wrong")
}
}
## (B) write.csv fn here?
}
}
when I try to do something in place (A) like write.csv(x= paste('group', i, sep=''), file= paste(endPath, '/group', i, '.csv', sep=''), row.names=FALSE I get errors or literally writing the string "group1" to a csv, rather than the chunk of the randomized dataframe i'm looking for. I'm super confused, as this seems like I'm running up against R semantics rather than a genuine programming issue.. Thanks in advance for the help.
You have indeed programmed yourself into a corner here, and it's a common one for beginners to end up in, particularly beginners that are coming to R from other programming languages.
The use of assign is the big red flag. At least when you're starting out in the language, if you feel yourself reaching for that function, stop and think again. You're most likely approaching the problem entirely wrong and need to rethink it.
Here is my (entirely untested) version of what you described, annotated with some comments:
split_file <- function(startFile,endPath,sizes){
#There's no need to use "..." for the partition sizes.
# A simple vector of values is much simpler
values <- read.csv(startFile,stringsAsFactors = FALSE)
if (sum(sizes) != nrow(values)){
#I'm assuming here that we're not doing anything fancy with bad input
stop("sizes do not evenly partition data!")
}else{
#Shuffle data frame
# Note we don't need as.data.frame()
values <- values[sample(nrow(values)),]
#Split data frame
values <- split(values,rep(seq_len(nrow(values)),times = sizes))
#Create the output file paths
paths <- paste0(endPath,"/group_",seq_along(sizes))
#We could shoe-horn this into lapply, but there's no real need
for (i in seq_along(values)){
write.csv(x = values[[i]],file = paths[i],row.names = FALSE)
}
}
}