For Loop t.test, Comparing Means by Factor Class in R - r

I want to loop a lot of one sided t.tests, comparing mean crop harvest value by pattern for a set of different crops.
My data is structured like this:
df <- data.frame("crop" = rep(c('Beans', 'Corn', 'Potatoes'), 10),
"value" = rnorm(n = 30),
"pattern" = rep(c("mono", "inter"), 15),
stringsAsFactors = TRUE)
I would like the output to provide results from a t.test, comparing mean harvest of each crop by pattern (i.e. compare harvest of mono-cropped potatoes to intercropped potatoes), where the alternative is greater value for the intercropped pattern.
Help!

Here's an example using base R.
# Generate example data
df <- data.frame("crop" = rep(c('Beans', 'Corn', 'Potatoes'), 10),
"value" = rnorm(n = 30),
"pattern" = rep(c("inter", "mono"), 15),
stringsAsFactors = TRUE)
# Create a list which will hold the output of the test for each crop
crops <- unique(df$crop)
test_output <- vector('list', length = length(crops))
names(test_output) <- crops
# For each crop, save the output of a one-sided t-test
for (crop in crops) {
# Filter the data to include only observations for the particular crop
crop_data <- df[df$crop == crop,]
# Save the results of a t-test with a one-sided alternative
test_output[[crop]] <- t.test(formula = value ~ pattern,
data = crop_data,
alternative = 'greater')
}
It's important to note that when calling t-test with the formula interface (e.g. y ~ x) and where your independent variable is a factor, then using the setting alternative = 'greater' will test whether the mean in the lower factor level (in the case of your data, "inter") is greater than the mean in the higher factor level (here, that's "mono").

Here's the elegant "tidyverse" approach, which makes use of the tidy function from broom which allows you to store the output of a t-test as a data frame.
Instead of a formal for loop, the group_by and do functions from the dplyr package are used to accomplish the same thing as a for loop.
library(dplyr)
library(broom)
# Generate example data
df <- data.frame("crop" = rep(c('Beans', 'Corn', 'Potatoes'), 10),
"value" = rnorm(n = 30),
"pattern" = rep(c("inter", "mono"), 15),
stringsAsFactors = TRUE)
# Group the data by crop, and run a t-test for each subset of data.
# Use the tidy function from the broom package
# to capture the t.test output as a data frame
df %>%
group_by(crop) %>%
do(tidy(t.test(formula = value ~ pattern,
data = .,
alternative = 'greater')))

Consider by, object-oriented wrapper to tapply designed to subset a data frame by factor(s) and run operations on subsets:
t_test_list <- by(df, df$crop, function(sub)
t.test(formula = value ~ pattern,
data = sub, alternative = 'greater')
)

Related

Combining for loops and ifelse in R

I am trying to write a for loop that will generate a correlation for a fixed column (LPS0) vs. all other columns in the data set. I don't want to use a correlation matrix because I only care about the correlation of LPS0 vs all other columns, not the correlations of the other columns with themselves. I then want to include an if statement to print only the significant correlations (p.value <= 0.05). I ran into some issues where some of the p.values are returned as NA, so I switched to an if_else loop. However, I am now getting an error. My code is as follows:
for(i in 3:ncol(microbiota_lps_0_morm)) {
morm_0 <- cor.test(microbiota_lps_0_morm$LPS0, microbiota_lps_0_morm[[colnames(microbiota_lps_0_morm)[i]]], method = "spearman")
if_else(morm_0$p.value <= 0.05, print(morm_0), print("Not Sig"), print("NA"))
}
The first value is returned, and then the loop stops with the following error:
Error in if_else():
! true must be length 1 (length of condition), not 8.
Backtrace: 1. dplyr::if_else(morm_0$p.value <= 0.05, print(morm_0), print("Not Sig"), print("NA"))
How can I make the loop print morm only when p.value <- 0.05?
Here's a long piece of code which aytomates the whole thing. it might be overkill but you can just take the matrix and use whatever you need. it makes use of the tidyverse.
df <- select_if(mtcars,is.numeric)
glimpse(df)
# keeping real names
dict <- cbind(original=names(df),new=paste0("v",1:ncol(df)))
# but changing names for better data viz
colnames(df) <- paste0("v",1:ncol(df))
# correlating between variables + p values
pvals <- list()
corss <- list()
for (coln in colnames(df)) {
pvals[[coln]] <- map(df, ~ cor.test(df[,coln], .)$p.value)
corss[[coln]] <- map(df, ~ cor(df[,coln], .))
}
# Keeping both matrices in a list
matrices <- list(
pvalues = matrix(data=unlist(pvals),
ncol=length(names(pvals)),
nrow=length(names(pvals))),
correlations = matrix(data=unlist(corss),
ncol=length(names(corss)),
nrow=length(names(corss)))
)
rownames(matrices[[1]]) <- colnames(df)
rownames(matrices[[2]]) <- colnames(df)
# Creating a combined data frame
long_cors <- expand.grid(Var1=names(df),Var2=names(df)) %>%
mutate(cor=unlist(matrices["correlations"]),
pval=unlist(matrices["pvalues"]),
same=Var1==Var2,
significant=pval<0.05,
dpcate=duplicated(cor)) %>%
# Leaving no duplicants, non-significant or self-correlation results
filter(same ==F,significant==T,dpcate==F) %>%
select(-c(same,dpcate,significant))
# Plotting correlations
long_cors %>%mutate(negative=cor<0) %>%
ggplot(aes(x=Var1,y=Var2,
color=negative,size=abs(cor),fill=Var2,
label=round(cor,2)))+
geom_label(show.legend = F,alpha=0.2)+
scale_color_manual(values = c("black","darkred"))+
# Sizing each correlation by it's magnitude
scale_size_area(seq(1,100,length=length(unique(long_cors$Var1))))+ theme_light()+
theme(axis.text = element_text(face = "bold",size=12))+
labs(title="Correlation between variables",
caption = "p < 0.05")+xlab("")+ylab("")
If you want to correlate a column of a matrix with the remaining columns, you can do so with one function call:
mtx <- matrix(rnorm(800), ncol=8)
cor(mtx[,1], mtx[,-1])
However, you will not get p-values. For getting p-values, I would recommend this approach:
library(tidyverse)
significant <- map_dbl(2:ncol(mtx),
~ cor.test(mtx[,1], mtx[,.], use="p", method="s")$p.value)
Whenever you feel like you need a for loop in R, chances are, you should be using another approach. for is a very un-R construct, and R gives many better ways of handling the same issues. map_* family of functions from tidyverse is but one of them. Another approach, in base R, would be to use apply:
significant <- apply(mtx[,-1], 2,
\(x) cor.test(x, mtx[,1], method="s", use="p")$p.value)

Using boot::boot() function with grouped variables in R

This is a question both about using the boot() function with grouped variables, but also about passing multiple columns of data into boot. Almost all examples of the boot() function seem to pass a single column of data to calculate a simple bootstrap of the mean.
My specific analysis is trying to use the stats::weighted.mean(x,w) function which takes a vector 'x' of values to calculate the mean and a second vector 'w' for weights. The main point is that I need two inputs into this function - and I'm hoping the solution will generalize to any function that takes multiple arguments.
I'm also looking for a solution to use this weighted.means function in a dplyr style workflow with group_by() variables. If the answer is that "it can't be done with dplyr", that's fine, I'm just trying to figure it out.
Below I simulate a dataset with three groups (A,B,C) that each have different ranges of counts. I also attempt to come up with a function "my.function" that will be used to bootstrap the weighted average. Here might be my first mistake: is this how I would set up a function to pass in the 'count' and 'weight' columns of data into each bootstrapped sample? Is there some other way to index the data?
Inside the summarise() call, I reference the original data with "." - Possibly another mistake?
The end result shows that I was able to achieve appropriately grouped calculations using mean() and weighted.mean(), but the calls for confidence intervals using boot() have instead calculated the 95% confidence interval around the global mean of the dataset.
Suggestions on what I'm doing wrong? Why is the boot() function referencing the entire dataset and not the grouped subsets?
library(tidyverse)
library(boot)
set.seed(20)
sample.data = data.frame(letter = rep(c('A','B','C'),each = 50) %>% as.factor(),
counts = c(runif(50,10,30), runif(50,40,60), runif(50,60,100)),
weights = sample(10,150, replace = TRUE))
##Define function to bootstrap
##I'm using stats::weighted.mean() which needs to take in two arguments
##############
my.function = function(data,index){
d = data[index,] #create bootstrap sample of all columns of original data?
return(weighted.mean(d$counts, d$weights)) #calculate weighted mean using 'counts' and 'weights' columns
}
##############
## group by 'letter' and calculate weighted mean, and upper/lower 95% CI limits
## I pass data to boot using "." thinking that this would only pass each grouped subset of data
##(e.g., only letter "A") to boot, but instead it seems to pass the entire dataset.
sample.data %>%
group_by(letter) %>%
summarise(avg = mean(counts),
wtd.avg = weighted.mean(counts, weights),
CI.LL = boot.ci(boot(., my.function, R = 100), type = "basic")$basic[4],
CI.UL = boot.ci(boot(., my.function, R = 100), type = "basic")$basic[5])
And below I've calculated a rough estimate of 95% confidence intervals around the global mean to show that this is what was going on with boot() in my summarise() call above
#Here is a rough 95% confidence interval estimate as +/- 1.96* Standard Error
mean(sample.data$counts) + c(-1,1) * 1.96 * sd(sample.data$counts)/sqrt(length(sample.data[,1]))
The following base R solution solves the problem of bootstrapping by groups. Note that boot::boot is only called once.
library(boot)
sp <- split(sample.data, sample.data$letter)
y <- lapply(sp, function(x){
wtd.avg <- weighted.mean(x$counts, x$weights)
basic <- boot.ci(boot(x, my.function, R = 100), type = "basic")$basic
CI.LL <- basic[4]
CI.UL <- basic[5]
data.frame(wtd.avg, CI.LL, CI.UL)
})
do.call(rbind, y)
# wtd.avg CI.LL CI.UL
#A 19.49044 17.77139 21.16161
#B 50.49048 48.79029 52.55376
#C 82.36993 78.80352 87.51872
Final clean-up:
rm(sp)
A dplyr solution could be the following. It also calls map_dfr from package purrr.
library(boot)
library(dplyr)
sample.data %>%
group_split(letter) %>%
purrr::map_dfr(
function(x){
wtd.avg <- weighted.mean(x$counts, x$weights)
basic <- boot.ci(boot(x, my.function, R = 100), type = "basic")$basic
CI.LL <- basic[4]
CI.UL <- basic[5]
data.frame(wtd.avg, CI.LL, CI.UL)
}
)
# wtd.avg CI.LL CI.UL
#1 19.49044 17.77139 21.16161
#2 50.49048 48.79029 52.55376
#3 82.36993 78.80352 87.51872

Simulate rnorm in R for many observations using mean and sd from each row

I am attempting to apply the rnorm function to many rows (214) of a data frame in R.
I want to use the predefined row mean and sd values of each row of the data frame to complete the simulations and n=10,000 for all observations.
I would like to use the apply function to do this, however, I am unclear how to write the rnorm call within the apply function to accomplish this for all rows at once.
Reproducible example:
set.seed(1)
Data <- data.frame(
Hazard = LETTERS[1:10],
mean = sample(1:10),
sd = c(0.14,0.23,0.21,0.27,0.12,0.19,0.21,0.18,
0.29,0.22)
)
Code I tried:
dist <- rnorm(10000, mean=Data$mean, sd=Data$sd)
apply(X= Data,
FUN = dist,
MARGIN = 1)
Thanks in advance for your assistance.
It may be better to use Map here where we loop over the corresponding elements of 'mean', 'sd' column, apply the rnorm and returns a list
n <- 10000
lst1 <- Map(function(x, y) rnorm(n, mean = x, sd = y), Data$mean, Data$sd)
Or if we prefer apply, then subset the numeric columns of interest and loop over it
apply(Data[-1], 1, FUN = function(x) rnorm(n, mean = x[1], sd = x[2]))
My solution would be use expand.grid to generate a new dataframe for the simulations for each possible mean/std combination.
library(dplyr)
sim_data <-
expand.grid(Hazard=Data$Hazard, iteration=1:10000) %>%
left_join(Data) %>%
mutate(x = rnorm(mean, sd))

Caret - creating stratified data sets based on several variables

In the R package caret, can we create stratified training and test sets based on several variables using the function createDataPartition() (or createFolds() for cross-validation)?
Here is an example for one variable:
#2/3rds for training
library(caret)
inTrain = createDataPartition(df$yourFactor, p = 2/3, list = FALSE)
dfTrain=df[inTrain,]
dfTest=df[-inTrain,]
In the code above the training and test sets are stratified by 'df$yourFactor'. But is it possible to stratify using several variables (e.g. 'df$yourFactor' and 'df$yourFactor2')? The following code seems to work but I don't know if it is correct:
inTrain = createDataPartition(df$yourFactor, df$yourFactor2, p = 2/3, list = FALSE)
This is fairly simple if you use the tidyverse.
For example:
df <- df %>%
mutate(n = row_number()) %>% #create row number if you dont have one
select(n, everything()) # put 'n' at the front of the dataset
train <- df %>%
group_by(var1, var2) %>% #any number of variables you wish to partition by proportionally
sample_frac(.7) # '.7' is the proportion of the original df you wish to sample
test <- anti_join(df, train) # creates test dataframe with those observations not in 'train.'
There is a better way to do this.
set.seed(1)
n <- 1e4
d <- data.frame(yourFactor = sample(1:5,n,TRUE),
yourFactor2 = rbinom(n,1,.5),
yourFactor3 = rbinom(n,1,.7))
stratum indicator
d$group <- interaction(d[, c('yourFactor', 'yourFactor2')])
sample selection
indices <- tapply(1:nrow(d), d$group, sample, 30 )
obtain subsample
subsampd <- d[unlist(indices, use.names = FALSE), ]
what this does is make a size 30 random stratified sample on every combination of yourFactor and yourFactor2.

R: iterate a function over two lists simultaneously using lapply?

I have multiple factors dividing my data.
By one factor (uniqueGroup), I would like to subset my data, by another factor (distance), I want to first classify my data by "moving threshold", and then test statistical difference between groups.
I have created a function movThreshold to classify my data, and test it by wilcox.test. To vary the different threshold values, I just run
lapply(th.list, # list of thresholds
movThreshold, # my function
tab = tab, # original data
dependent = "infGrad") # dependent variable
Now I've realized, that in fact I need to firstly subset my data by uniqueGroup, and then vary the threshold value. But I am not sure, how to write it in my lapply code?
My dummy data:
set.seed(10)
infGrad <- c(rnorm(20, mean=14, sd=8),
rnorm(20, mean=13, sd=5),
rnorm(20, mean=8, sd=2),
rnorm(20, mean=7, sd=1))
distance <- rep(c(1:4), each = 20)
uniqueGroup <- rep(c("x", "y"), 40)
tab<-data.frame(infGrad, distance, uniqueGroup)
# Create moving threshold function &
# test for original data
# ============================================
movThreshold <- function(th, tab, dependent, ...) {
# Classify data
tab$group<- ifelse(tab$distance < th, "a", "b")
# Calculate wincoxon test - as I have only two groups
test<-wilcox.test(tab[[dependent]] ~ as.factor(group), # specify column name
data = tab)
# Put results in a vector
c(th, unique(tab$uniqueGroup), dependent, uniqueGroup, round(test$p.value, 3))
}
# Define two vectors to run through
# unique group
gr.list<-unique(tab$uniqueGroup)
# unique threshold
th.list<-c(2,3,4)
How to run lapply over two lists??
lapply(c(th.list,gr.list), # iterate over two vectors, DOES not work!!
movThreshold,
tab = tab,
dependent = "infGrad")
In my previous question (Kruskal-Wallis test: create lapply function to subset data.frame?), I've learnt how to iterate through individual subsets within a table:
lapply(split(tab, df$uniqueGroup), movThreshold})
But how to iterate through subsets, and through thresholds at once?
If I understood correctly what you're trying to do, here is a data.table solution:
library(data.table)
setDT(tab)[, lapply(th.list, movThreshold, tab = tab, dependent = "infGrad"), by = uniqueGroup]
Also, you can just do a nested lapply.
lapply(gr.list, function(z) lapply(th.list, movThreshold, tab = tab[uniqueGroup == z, ], dependent = "infGrad"))
I apologize, If I misunderstood what you're trying to do.

Resources