I like to run multiple one-way ANOVAs over multiple columns of a data frame. My approach to doing this was with a for loop. The first column of my data frame contains the groups. To provide a reproducible example, i here take the iris data set. I want to use rstatix::anova_test() instead of f.ex. aov(), because rstatix::anova_test() is pipe-friendly, seems to be a better option for unbalanced data (like i have it) and allows also to define the type of sums of squares for ANOVA.
When i write a for loop with aov() it works. Unfortunately, I have so far failed doing similar with rstatix::anova_test(). Please can anybody help me?
data <- iris %>% relocate(Species, .before = Sepal.Length)
# Define object which will receive the results
results <- NULL
results <- as.data.frame(results)
with aov() it works
for(i in 2:ncol(data)){
# Put the name of the variable in first column of results object
results[i-1,1] <- names(data)[i]
# ANOVA test iterating through each column of the data frame and save output in a temporary object.
temp_anova_results <- broom::tidy(aov(data[,i] ~ Species, data = data))
# write ANOVA p value in second column of results object
results[i-1,2] <- temp_anova_results$p.value[1]
rm(temp_anova_results)
}
for several reasons i like to work with rstatix::anova_test(), but failed to get a correct for loop, an example i tried:
for(i in 2:ncol(data)){
# Put the name of the variable in first column of results object
results[i-1,1] <- names(data)[i]
# ANOVA test iterating through each column of the data frame and save output in a temporary object.
temp_anova_results <- data %>% anova_test(data[,i] ~ Species, type = 3)
# write ANOVA p value in second column of results object
results[i-1,2] <- temp_anova_results$p[1]
rm(temp_anova_results)
}
data %>% anova_test(data[,i] ~ Species) seems to be the problem, but works outside the for loop when inserting a number for i, like f.ex. data %>% anova_test(data[,2] ~ Species)
Maybe somebody else has a better answer, but the only way I can get this to work is to build the formula from the column names, that is replace your anova_test line with:
temp_anova_results <- data %>% anova_test(formula(paste0(names(dat)[i],"~","Species")))
I don't know why your method didn't work. Even outside of the loop using i instead of a numeric constant broke the anova_test function call.
Related
I'm new in R and coding in general...
I have computed multiple anova analysis on multiple columns (16 in total).
For that purpose, the method "Purr" helped me :
anova_results_5sector <- purrr::map(df_anova_ch[,3:18], ~aov(.x ~ df_anova_ch$Own_5sector))
summary(anova_results_5sector[[1]])
So the dumbest way to retrieve output (p-value, etc) is the following method
summary(anova_results_5sector$Env_Pillar)
summary(anova_results_5sector$Gov_Pillar)
summary(anova_results_5sector$Soc_Pillar)
summary(anova_results_5sector$CSR_Strat)
summary(anova_results_5sector$Comm)
summary(anova_results_5sector$ESG_Comb)
summary(anova_results_5sector$ESG_Contro)
summary(anova_results_5sector$ESG_Score)
summary(anova_results_5sector$Env_Innov)
summary(anova_results_5sector$Human_Ri)
summary(anova_results_5sector$Management)
summary(anova_results_5sector$Prod_Resp)
I've tried to use a loop :
for(i in 1:length(anova_results_5sector)){
summary(anova_results_5sector$[i])
}
It didn't work, I dont know and did not find how to deal with $ in for loop
Here you have a look of the structure of the output vector
Structure of output
I have tried several times with others methods, more or less complicated. Often the examples found online are too simple and does not allow me to adapt to my data.
Any tips ?
Thank you and sorry for such an noobie question
Whenever I use a loop for an analysis I like to store the results in a data.frame, it allows to keep a good overview. Since you did not provide a reproducible example I used the iris dataset:
data("iris")
#make a data frame to store the results with as many columns and rows as you need
anova_results <- data.frame(matrix(ncol = 3, nrow = 3))
#one column per value you want to store and one row per anova you want to run
x <- c("number", "Mean_Sq", "p_value") #assign all values you want to store as column names
colnames(anova_results) <- x
anova_results$number <- 1:3 #assign numers for each annova you want to run, eg. 3
In the loop you can now extract the results of the anova that you are interested in, I use mean squares and p-value as an example, but you can of course add others. Don't forget to add a coulmn for other values you want to add.
for (i in 2:4){
my_anova <- aov(iris[[1]] ~ iris[[i]])
p <- summary(my_anova)[[1]][["Pr(>F)"]][1] #extract the p value
anova_results$p_value[anova_results$number == i-1] <- p
mean <- summary(my_anova)[[1]][["Mean Sq"]][1] #extract the mean quares
anova_results$Mean_Sq[anova_results$number == i-1] <- mean
}
View(anova_results)
I have a dataset with 61 columns (60 explanatory variables and 1 response variable).
All the explantory variables all numerical, and the response is categorical (Default).Some of the ex. variables have negative values (financial data), and therefore it seems more sensible to standardize rather than normalize. However, when standardizing using the "apply" function, I have to remove the response variable first, so I do:
model <- read.table......
modelwithnoresponse <- model
modelwithnoresponse$Default <- NULL
means <- apply(modelwithnoresponse,2mean)
standarddeviations <- apply(modelwithnoresponse,2,sd)
modelSTAN <- scale(modelwithnoresponse,center=means,scale=standarddeviations)
So far so good, the data is standardized. However, now I would like to add the response variable back to the "modelSTAN". I've seen some posts on dplyr, merge-functions and rbind, but I couldnt quite get to work so that response would simply be added back as the last column to my "modelSTAN".
Does anyone have a good solution to this, or maybe another workaround to standardize it without removing the response variable first?
I'm quite new to R, as I'm a finance student and took R as an elective..
If you want to add the column model$Default to the modelSTAN data frame, you can do it like this
# assign the column directly
modelSTAN$Default <- model$Default
# or use cbind for columns (rbind is for rows)
modelSTAN <- cbind(modelSTAN, model$Default)
However, you don't need to remove it at all. Here's an alternative:
modelSTAN <- model
## get index of response, here named default
resp <- which(names(modelSTAN) == "default")
## standardize all the non-response columns
means <- colMeans(modelSTAN[-resp])
sds <- apply(modelSTAN[-resp], 2, sd)
modelSTAN[-resp] <- scale(modelSTAN[-resp], center = means, scale = sds)
If you're interested in dplyr:
library(dplyr)
modelSTAN <- model %>%
mutate(across(-all_of("default"), scale))
Note, in the dplyr version I didn't bother saving the original means and SDs, you should still do that if you want to back-transform later. By default, scale will use the mean and sd.
I am attempting to fit a bunch of different models to a single dataset. Each of the models uses a different combination of outcome variable and data subset. To fit all of these models, I created a dataframe with one column for the outcome variable and one column specifying the data subset (as a string). (Note that the subsets are overlapping so there doesn't appear to be an obvious way to do this using nest().) I then created a new function which takes one row of this dataframe and calls "lm" using these options. Lastly, I use pmap to map this function to the dataframe.
After a bunch of experimentation, I found an approach that works but that is rather inelegant (see below for a simplified version of what I did). It seems like there should be a way to pass the subset condition to the subset argument in lm rather than using parse(eval(text = condition)) to first create a logical vector. I read the Advanced R section on metaprogramming in the hopes that they would provide some insight, but I was unable to find anything that works.
Any suggestions would be helpful.
library(tidyverse)
outcomes <- c("mpg", "disp")
sub_conditions <- c("mtcars$cyl >=6", "mtcars$wt > 2")
models <- expand.grid(y = outcomes, condition = sub_conditions) %>% mutate_all(as.character)
fit <- function(y, condition) {
# Create the formula to use in all models
rx <- paste(y, "~ hp + am")
log_vec <- eval(parse(text = condition))
lm(rx, data = mtcars[log_vec,])
}
t <- pmap(models, fit)
Are you sure you want to pass conditions in this way using string?
If that is the case, there are not many options. You can use rlang::parse_expr as an alternative.
fit <- function(y, condition) {
rx <- paste(y, "~ hp + am")
lm(rx, data = mtcars[eval(rlang::parse_expr(condition)),])
}
and call it via
purrr::pmap(models, fit)
I have a data set of plant demographics from 5 years across 10 sites with a total of 37 transects within the sites. Below is a link to a GoogleDoc with some of the data:
https://docs.google.com/spreadsheets/d/1VT-dDrTwG8wHBNx7eW4BtXH5wqesnIDwKTdK61xsD0U/edit?usp=sharing
In total, I have 101 unique combinations.
I need to subset each unique set of data, so that I can run each through some code. This code will give me one column of output that I need to add back to the original data frame so that I can run LMs on the entire data set. I had hoped to write a for-loop where I could subset each unique combination, run the code on each, and then append the output for each model back onto the original dataset. My attempts at writing a subset loop have all failed to produce even a simple output.
I created a column, "SiteTY", with unique Site, Transect, Year combinations. So "PWR 832015" is site PWR Transect 83 Year 2015. I tried to use that to loop through and fill an empty matrix, as proof of concept.
transect=unique(dat$SiteTY)
ntrans=length(transect)
tmpout=matrix(NA, nrow=ntrans, ncol=2)
for (i in 1:ntrans) {
df=subset(dat, SiteTY==i)
tmpout[i,]=(unique(df$SiteTY))
}
When I do this, I notice that df has no observations. If I replace "i" with a known value (like PWR 832015) and run each line of the for-loop individually, it populates correctly. If I use is.factor() for i or PWR 832015, both return FALSE.
This particular code also gives me the error:
Error in [,-(*tmp*, , i, value=mean(df$Year)) : subscript out of bounds
I can only assume this happens because the data frame is empty.
I've read enough SO posts to know that for-loops are tricky, but I've tried more iterations than I can remember to try to make this work in the last 3 years to no avail.
Any tips on loops or ways to avoid them while getting the output I need would be appreciated.
Per your needs, I need to subset each unique set of data, run a function, take the output and calculate a new value, consider two routes:
Using ave if your function expects and returns a single numeric column.
Using by if your function expects a data frame and returns anything.
ave
Returns a grouped inline aggregate column with repeated value for every member of group. Below, with is used as context manager to avoid repeated dat$ references.
# BY SITE GROUPING
dat$New_Column <- with(dat, ave(Numeric_Column, Site, FUN=myfunction))
# BY SITE AND TRANSECT GROUPINGS
dat$New_Column <- with(dat, ave(Numeric_Column, Site, Transect, FUN=myfunction))
# BY SITE AND TRANSECT AND YEAR GROUPINGS
dat$New_Column <- with(dat, ave(Numeric_Column, Site, Transect, Year, FUN=myfunction))
by
Returns a named list of objects or whatever your function returns for each possible grouping. For more than one grouping, tryCatch is used due to possibly empty data frame item from all possible combinations where your myfunction can return an error.
# BY SITE GROUPING
obj_list <- by(dat, dat$Site, function(sub) {
myfunction(sub) # RUN ANY OPERATION ON sub DATA FRAME
})
# BY SITE AND TRANSECT GROUPINGS
obj_list <- by(dat, dat[c("Site", "Transect")], function(sub) {
tryCatch(myfunction(sub),
error = function(e) NULL)
})
# BY SITE AND TRANSECT AND YEAR GROUPINGS
obj_list <- by(dat, dat[c("Site", "Transect", "Year")], function(sub) {
tryCatch(myfunction(sub),
error = function(e) NULL)
})
# FILTERS OUT ALL NULLs (I.E., NO LENGTH)
obj_list <- Filter(length, obj_list)
# BUILDS SINGLE OUTPUT IF MATRIX OR DATA FRAME
final_obj <- do.call(rbind, obj_list)
Here's another approach using the dplyr library, in which I'm creating a data.frame of summary statistics for each group and then just joining it back on:
library(dplyr)
# Group by species (site, transect, etc) and summarise
species_summary <- iris %>%
group_by(Species) %>%
summarise(mean.Sepal.Length = mean(Sepal.Length),
mean.Sepal.Width = mean(Sepal.Width))
# A data.frame with one row per species, one column per statistic
species_summary
# Join the summary stats back onto the original data
iris_plus <- iris %>% left_join(species_summary, by = "Species")
head(iris_plus)
This is something which data analysts do all the time (especially when working with survey data which features missing responses.) It's common to first multiply impute a set of compete data matrices, fit models to each of these matrices, and then combine the results. At the moment I'm doing things by hand and looking for a more elegant solution.
Imagine there's 5 *.csv files in the working directory, named dat1.csv, dat2.csv, ... dat5.csv. I want to estimate the same linear model using each data set.
Given this answer, a first step is to gather a list of the files, which I do with the following
csvdat <- list.files(pattern="dat.*csv")
Now I want to do something like
for(x in csvdat) {
lm.which(csvdat == "x") <- lm(y ~ x1 + x2, data = x)
}
The "which" statement is my silly way of trying to number each model in turn, using the location in the csvdat list the loop is currently up to. that is, I'd like this loop to return a set of 5 lm objects with the names lm.1, lm.2, etc
Is there some simple way to create these objects, and name them so that I can easily indicate which data set they correspond to?
Thanks for your help!
Another approach is to use the plyr package to do the looping. Using the example constructed by #chl, here is how you would do it
require(plyr)
# read csv files into list of data frames
data_frames = llply(csvdat, read.csv)
# run regression models on each data frame
regressions = llply(data_frames, lm, formula = y ~ .)
names(regressions) = csvdat
Use a list to store the results of your regression models as well, e.g.
foo <- function(n) return(transform(X <- as.data.frame(replicate(2, rnorm(n))),
y = V1+V2+rnorm(n)))
write.csv(foo(10), file="dat1.csv")
write.csv(foo(10), file="dat2.csv")
csvdat <- list.files(pattern="dat.*csv")
lm.res <- list()
for (i in seq(along=csvdat))
lm.res[[i]] <- lm(y ~ ., data=read.csv(csvdat[i]))
names(lm.res) <- csvdat
what you want is a combination of the functions seq_along() and assign()
seq_along helps creates a vector from 1 to 5 if there are five objects in csvdat (to get the appropriate numbers and not only the variable names). Then assign (using paste to create the appropriate astrings from the numbers) lets you create the variable.
Note that you will also need to load the data file first (was missing in your example):
for (x in seq_along(csvdat)) {
data.in <- read.csv(csvdat[x]) #be sure to change this to read.table if necessary
assign(paste("lm.", x, sep = ""), lm(y ~ x1 + x2, data = data.in))
}
seq_along is not totally necessary, there could be other ways to solve the numeration problem.
The critical function is assign. With assign you can create variables with a name based on a string. See ?assign for further info.
Following chl's comments (see his post) everything in one line:
for (x in seq_along(csvdat)) assign(paste("lm", x, sep = "."), lm(y ~ x1 + x2, data = read.csv(csvdat[x]))