R for loop to summarize matrix of data - r

New user to R (like 2 days of use new) and coming from MATLAB, syntax nuances are driving me a little crazy. If anyone can point me in a direction on this topic I would really appreciate it. I have this dataset (fl1.back), that has 32 variables (columns) and 513 measurements (rows), and I want to create a table with basic stat tables of 9 of the 32 columns of data. There's a separate datset(fl2.back) that I would also like to pull 1 column of data from for the final table.
Here's the code I used to do the above tasks for 1 of the columns of data (sodium measurements) from fl1.back and fl2.back:
fl1.back <- read.delim("web.flat",comment.char="#",colClasses="character")
fl1.back <- fl1.back[-1,]
fl2.back <- read.delim("web.flat2",comment.char="#",colClasses="character")
fl2.back <- fl2.back[-1,]
head(fl1.back)
head(fl2.back)
#for rep criteria for sodium
back.sod.rep <- fl2.back[fl2.back$P00930!="",]
back.sod.rep$P00930 <- as.numeric(back.sod.rep$P00930)
back.sod.rep$P00930
#for samples...sodium
back.sod <- fl1.back[fl1.back$P00930!="",]
back.sod$P00930 <- as.numeric(back.sod$P00930)
back.sod$P00930
head(back.sod)
back.sod.summ <- data.frame("Sodium")
back.sod.summ
colnames(back.sod.summ) <- "Compound"
back.sod.summ$WQ_crit <- "20 mg/L"
back.sod.summ$n <- nrow(back.sod)
back.sod.summ$n_det <- nrow(back.sod[back.sod$R00930!="<",])
back.sod.summ$min <- min(back.sod[back.sod$R00930!="<","P00930"])
back.sod.summ$max <- max(back.sod[back.sod$R00930!="<","P00930"])
back.sod.summ$mean <- mean(back.sod[back.sod$R00930!="<","P00930"])
back.sod.summ$median <- median(back.sod[back.sod$R00930!="<","P00930"])
back.sod.summ$percent_samp_det <- 100*(back.sod.summ$n_det/back.sod.summ$n)
back.sod.summ$percent_samp_above_crit <- 100*(length(back.sod[back.sod$P00930>20,"P00930"])/back.sod.summ$n)
back.sod.summ$percent_rep_above_crit <- (sum(back.sod.rep$P00930>=20)/(nrow(back.sod.rep)))
back.sod$P00930
length(back.sod[back.sod$P00930>back.sod.summ$WQ_crit,"P00930"])
back.sod.summ
final <- data.frame(back.sod.summ)
Instead of rewriting/copying and pasting the above code to create the data frame final, I would like to loop over the two datasets since I'm looking to repeat the same task, just on different columns of data. I really don't know where to start, and there doesn't seem to be much literature on for loops in R.
Any insight is appreciated!

Here is an example of what I think you want with the iris dataset:
library(plyr)
dlply(iris, .(Species), summary)
This can be extended if you need additional stats. Anyway, you probably should use (as I show above) the "split-apply-combine" approach as implemented in various functions and packages.

Related

Saving code by avoid writing consecutive numbers in R

Suppose we have these objects:
distM_ref1_matrix <- dist(cbind(distM_ref1$x, distM_ref1$y))
distM_ref2_matrix <- dist(cbind(distM_ref2$x, distM_ref2$y))
distM_ref3_matrix <- dist(cbind(distM_ref3$x, distM_ref3$y))
distM_ref4_matrix <- dist(cbind(distM_ref4$x, distM_ref4$y))
distM_ref5_matrix <- dist(cbind(distM_ref5$x, distM_ref5$y))
distM_ref6_matrix <- dist(cbind(distM_ref6$x, distM_ref6$y))
distM_ref7_matrix <- dist(cbind(distM_ref7$x, distM_ref7$y))
distM_ref8_matrix <- dist(cbind(distM_ref8$x, distM_ref8$y))
distM_ref9_matrix <- dist(cbind(distM_ref9$x, distM_ref9$y))
I have two questions:
How can I save code by creating all these objects automatically with a single line? Consider I do have tons of equivalent objects, not only 9 as in the example.
How can I calculate the mean value of all these objects at the same time?
I think that the answer for the second question would be somehow like this code:
mean(c(distM_ref[1:9]_matrix))

Adding a line of data of different types to a row of a data frame

I am experimenting with different regression models. My end goal is to have a nice easy to read dataframe with 3 columns:
model_results <- data.frame(name = character(),
rmse = numeric(),
r2 = numeric())
Then after running each model, add the corrosponding output to the dataframe and then, at the end, review and make some decisions on which model to use.
I tried this:
mod.spend_transactions.results <- list("mod.spend_transactions",
rsme(residuals(mod.spend_transactions)),
summary(mod.spend_transactions)$r.squared)
I tried using a list because I know vectors can only store one datatype (right?).
Output:
rbind(model_results, mod.spend_transactions.results)
X.mod.spend_transactions. X12.6029444519635 X0.912505643567096
1 mod.spend_transactions 12.60294 0.9125056
Close but not what I wnated since the df names have been changed and I did not expect that.
So I tried vectors, which works but seems "clunky" in that I'm sure I could do this with writing less code:
vect_modname <- vector()
vect_rsme <- vector()
vect_r2 <- vector()
Then after running a model
vect_modname <- c(vect_modname, "mod.spend_transactions")
vect_rsme <- c(vect_rsme, rsme(residuals(mod.spend_transactions)))
vect_r2 <- c(vect_r2, summary(mod.spend_transactions)$r.squared)
Then at the end of running all the models I'm testing out
data.frame(vect_modname, vect_rsme, vect_r2)
Again, the vector method does work. But is there a "better", more elegant way of doing this?

Including additional data at each stage of a loop

I am trying to create minimum convex polygons for a set of GPS coordinates, each day has 32 coordinates and I want to create a MCP with 1 day,2 days,3 days... and so on worth of data. For instance in the first step I want to include rows 1-32 which I have managed:
mydata <- read.csv("file.csv", stringsAsFactors = FALSE)
mydata <- mydata[1:32, ]
Currently to select data for me to do 2 days at a time I have written:
mydata <- read.csv("file.csv", stringsAsFactors = FALSE)
mydata <- mydata[1:64, ]
Is there a way to automate adding 32 rows at each step (in a loop) rather than me running the code manually each time and changing the amount of data used manually each time?
I am very new to R so I do not know whether it is possible to do this, the way I thought would work was:
n <- 32
for (i in 1:100) {
mydata <- mydata[1:n, ]
## CREATE MCP AND STORE HOME RANGE OUTPUT
n <- n+32
}
However it is not possible to have n representing a row number but is there a way to do this?
Apologies if this is unclear but as I said I am quite new to using R and really would appreciate any help that can be given.

calculate percentage in R

I am a beginner in R and R is for me actually only the means to analyse my statistical data, so I am far from being a programmer. I need some help with Building percentages of my variables from an Excel sheet. I Need R.total with R.Max as 100% base. this is what I did:
DB <- read_excel("WechslerData.xlsx", sheet=1, col_names=TRUE,
col_types=NULL, na="", skip=0)
I wanted to to use prop.table
but this dose not work with me. than I tried to make data frame
R.total <- DB$R.total
R.max <- DB$R.max
DB.rus <- data.frame(R.total, R.max)
but prop.table still dose not work. Can somebody give me a hint?
Not really sure what you want, but for this mock data.
r.total <- runif(100,min=0, max=.6) # generate random variable
r.max <- runif(100,min=0.7, max=1) # generate random variable
df <- data.frame(r.total, r.max) # create mock data frame
You could try
# create a new column which is the r.total percentage of r.max
df$percentage <- df$r.total / df$r.max
Hope it helps.

Is there a way to parallelize summary functions running over loop?

For an input data frame
input<-data.frame(col1=seq(1,10000),col2=seq(1,10000),col3=seq(1,10000),col4=seq(1,10000))
I have to run the following summaries stored in another Data frame
summary<-data.frame(Summary_name=c('Col1_col2','Col3_Col4','Col2_Col3'),
ColIndex=c("1,2","3,4","2,3"))
#summary
Summary_name ColIndex
Col1_col2 1,2
Col3_Col4 3,4
Col2_Col3 2,3
I have the following function to run the aggregates
loopSum<-function(input,summary){
for(i in seq(1,nrow(summary))){
summary$aggregate[i]<-sum(input[,as.numeric(unlist(str_split(summary$ColIndex[i],',')))])}
return(summary)
}
My requirement is to run the sum as used in loopSum only in parallel, ie I would like to run all the summaries in one shot and thus reduce the total time taken for the function to create the summaries. Is there a way to do this?
My actual scenarios requires me to create summary statistics over hundreds of columns for each Summary_name in summary data.frame, I am looking for the most optimized way to do this. Any help is much appreciated.
Does it improve the running time?
library(tidyr)
input1 <- colSums(input)
summary1 <- separate(summary, "ColIndex", into=c("X1", "X2"), sep=",", convert = TRUE)
summary$aggregate <- input1[summary1$X1] + input1[summary1$X2]

Resources