How to reverse the group comparison order for t.test()? - r

I would like to conduct a two-sample independent t-test and my grouping variable, "group" is factored as '0' and '1'. When calling the t.test(), it calculates the difference as mean for group 0 - mean for group 1, giving me a negative difference and negative confidence intervals.
How can I reverse the order of comparison so that R estimates mean for group 1 - mean for group 0? I know it compares alphabetically, so I assume it compares numbers in increasing value. I already tried fct_rev() from the forcats package to set group '1' as the reference group, but that did not change the order in which t.test compared and still gave negative outputs. This is important as it is a 1 sided test so I have expectations on the sign of the difference.
Thanks!
EDIT: I'd like to keep the names as '0' and '1' because they need to be coded as such for subsequent analyses to work

You can use the levels argument when constructing the factor. It is possible to revert the levels (see code below) and it is also possible to specify an arbitrary order:
# two samples with different mean
x <- c(rnorm(10, 5, 1), rnorm(10, 8, 1))
# the grouping variable
f1 <- factor(rep(0:1, each=10))
# the grouping variable with reverse order of levels
f2 <- factor(f1, levels=rev(levels(f1)))
t.test(x ~ f1)
t.test(x ~ f2)

For some reason I cannot comment, can you not simply recode your variables and rerun it so they run in the correct order?
e.g.:
mydata <- read.csv(yourdataset.csv, na.strings = "")
#make a new "recoded group" variable
mydata$group.recoded <- mydata$group
mydata <- within(mydata, group.recoded[group.variable==group0] <- "group2")

Related

2-sample independent t-test where each of two columns is in different data frame

I need to run a 2-sample independent t-test, comparing Column1 to Column2. But Column1 is in DataframeA, and Column2 is in DataframeB. How should I do this?
Just in case relevant (feel free to ignore): I am a true beginner. My experience with R so far has been limited to running 2-sample matched t-tests within the same data frame by doing the following:
t.test(response ~ Column1,
data = (Dataframe1 %>%
gather(key = "Column1", value = "response", "Column1", "Column2")),
paired = TRUE)
TL;DR
t_test_result = t.test(DataframeA$Column1, DataframeB$Column2, paired=TRUE)
Explanation
If the data is paired, I assume that both dataframes will have the same number of observations (same number of rows). You can check this with nrow(DataframeA) == nrow(DataframeB) .
You can think of each column of a dataframe as a vector (an ordered list of values). The way that you have used t.test is by using a formula (y~x), and you were essentially saying: Given the dataframe specified in data, perform a t test to assess the significance in the difference in means of the variable response between the paired groups in Column1.
Another way of thinking about this is by grabbing the data in data and separating it into two vectors: the vector with observations for the first group of Column1, and the one for the second group. Then, for each vector, you compute the mean and stdev and apply the appropriate formula that will give you the t statistic and hence the p value.
Thus, you can just extract those 2 vectors separately and provide them as arguments to the t.test() function. I hope it was beginner-friendly enough ^^ otherwise let me know
EDIT: a few additions
(I was going to reply in the comments but realized I did not have space hehe)
Regarding the what #Ashish did in order to turn it into a Welch's test, I'd say it was to set var.equal = FALSE. The paired parameter controls whether the t-test is run on paired samples or not, and since your data frames have unequal number of rows, I'm suspecting the observations are not matched.
As for the Cohen's d effect size, you can check this stats exchange question, from which I copy the code:
For context, m1 and m2 are the group's means (which you can get with n1 = mean(DataframeA$Column1)), s1 and s2 are the standard deviations (s2 = sd(DataframeB$Column2)) and n1 and n2 the sample sizes (n2 = length(DataframeB$Column2))
lx <- n1- 1 # Number of observations in group 1
ly <- n2- 1 # # Number of observations in group 1
md <- abs(m1-m2) ## mean difference (numerator)
csd <- lx * s1^2 + ly * s2^2
csd <- csd/(lx + ly)
csd <- sqrt(csd) ## common sd computation
cd <- md/csd ## cohen's d
This should work for you
res = t.test(DataFrameA$Column1, DataFrameB$Column2, alternative = "two.sided", var.equal = FALSE)

Replacing outliers in whole data set (based on Tukey and each level of a categorical variable) in R

How can I detect the outliers of all data set (all continuous columns) based on a categorical variable and replace them with NA. I want to use Tukey technique but focusing on each level of a categorical variable. For example replace the outliers of mtcars[, -c(8,9)] with NA based on the each level of mtcars$am
OR How can I modify this code to work for all variables in each level of am.
lapply(mtcars, function(x){sort(outlier_values<- boxplot.stats(x)$out)})
EDIT: outliers are now 1.5*IQR, as specified in comment.
This replaces the outliers in the qsec column per group in the am column with NA's. It does so by first constructin a dataframe called limits, which contains lower- and upperbounds per am group. Then, that dataframe is joined with the original dataframe, and outliers are filtered.
library(dplyr)
limits = data.frame(am = unique(mtcars$am))
limits$lower = lapply(limits$am, function(x) quantile(mtcars$qsec[mtcars$am==x],0.25) - 1.5 * (quantile(mtcars$qsec[mtcars$am==x],0.75)- quantile(mtcars$qsec[mtcars$am==x],0.25)) )
limits$upper = lapply(limits$am, function(x) quantile(mtcars$qsec[mtcars$am==x],0.75) + 1.5 * (quantile(mtcars$qsec[mtcars$am==x],0.75)- quantile(mtcars$qsec[mtcars$am==x],0.25)) )
df = mtcars %>% left_join(limits)
df$qsec = ifelse(df$qsec< df$lower | df$qsec>df$upper,NA,df$qsec)
df = df %>% select(-upper,-lower)
The a parameter can be used to determine what proportion is considered an outlier.

Calculate the average of a subsample with cast in R reshape2

I am attempting to calculate the average of a subsample with the function acast. As the subsample, I want to use data within a percentile range for which I use quantile within the subset. The problem seems that the quantiles are calculated before arranging the data by groups, thus the same values are used. See the example below:
library(reshape2)
library(plyr)
data(airquality)
aqm <- melt(airquality, id=c("Month", "Day"), na.rm=TRUE)
## here I calculate the length for each group for the whole sample
acast(aqm, variable + Month ~ . , length, value.var = "value")
## here I calculate the length for the range within the quantiles 0.05 - 0.5
acast(aqm, variable + Month ~ . , length, value.var = "value", subset = .(value >= quantile(value,c(0.05)) & value <= quantile(value,c(0.5))))
I should get with the subset half of the observations for each group, but instead I get in some cases way lees than half and in other way more. It seems to me that the quantiles are calculated with the melted data, therefore the function applies the same quantile numbers to all groups.
Does anyone have an idea how to make that the quantiles are calculates for each group? Any help would be appreciated. I know this would be possible by doing a loop by categories, but want to see if there is a way to do it all at once.
Thanks,
Sergio René

Linear regression on subsets with dependent variable per column using dlply() in R

I would like to automatically produce linear regressions for a data frame for each category separately.
My data frame includes one column with time categories, one column (slope$Abs) as the dependent variable, several columns, which should be used as the independent variable.
head(slope)
timepoint Abs In1 In2 In3 Out1 Out2 Out3 ...
1: t0 275.0 2.169214 2.169214 2.169214 2.069684 2.069684 2.069684
2: t0 275.5 2.163937 2.163937 2.163937 2.063853 2.063853 2.063853
3: t0 276.0 2.153298 2.158632 2.153298 2.052088 2.052088 2.057988
4: ...
All in all for each timepoint I have 40 variables, and I want to end up with a linear regression for each combination. Such as In1~Abs[t0], In1~Abs[t1] and so on for each column.
Of course I can do this manually, but I guess there must be a more elegant way to do the work.
I did my research and found out that dlply() might be the function I'm looking for. However, my attempt results in an error.
So I somehow tried to combine the answers from previous questions I have found:
On individual variables per column and on subsets per category
I came up with a function like this:
lm.fun <- function(x) {summary(lm(x ~ slope$Abs, data=slope))}
lm.list <- dlply(.data=slope, .variables=slope$timepoint, .fun=lm.fun )
But I get the following error:
Error in eval.quoted(.variables, data) :
envir must be either NULL, a list, or an environment.
Hope someone can help me out.
Thanks a lot in advance!
The dplyr package in R does not do well in accepting formulas in the form of y~x into its functions based on my research. So the other alternative is to calculate it someone manually. Now let me first inform you that slope = cor(x,y)*sd(y)/sd(x) (reference found here: http://faculty.cas.usf.edu/mbrannick/regression/regbas.html) and that the intercept = mean(y) - slope*mean(x). Simple linear regression requires that we use the centroid as our point of reference when finding our intercept because it is an unbiased estimator. Using a single point will only get you the intercept of that individual point and not the overall intercept.
Now for this explanation, I will be using the mtcars data set. I only wanted a subset of the data so I am using variables c('mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec') to basically mimic your dataset. In my example, my grouping variable is 'cyl', which is the equivalent of your 'timepoint' variable. The variable 'mpg' is the y-variable in this case, which is equivalent to 'Abs' in your data.
Based on my explanation of slope and intercept above, it is clear that we need three tables/datasets: a correlation dataset for your y with respect to your x for each group, a standard deviation table for each variable and group, and a table of means for each group and each variable.
To get the correlation dataset, we want to group by 'cyl' and calculate the correlation coefficients for , you should use:
df <- mtcars[c('mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec')]
corrs <- data.frame(k1 %>% group_by(cyl) %>% do(head(data.frame(cor(.[,c(1,3:7)])), n = 1)))
Because of the way my dataset is structured, the second variable (df[ ,2]) is 'cyl'. For you, you should use
do(head(data.frame(cor(.[,c(2:40)])), n = 1)))
since your first column is the grouping variable and it is not numeric. Essentially, you want to go across all numeric variables. Not using head will produce a correlation matrix, but since you are interested in finding the slope independent of each other x-variable, you only need the row that has the correlation coefficient of your y-variable equal to 1 (r_yy = 1).
To get standard deviation and means for each group, each variable, use
sds <- data.frame(k1 %>% group_by(cyl) %>% summarise_each(funs(sd)))
means <- data.frame(k1 %>% group_by(cyl) %>% summarise_each(funs(mean)))
Your group names will be the first column, so make sure to rename your rows for each dataset corrs, sds, and means and delete column 1.
rownames(corrs) <- rownames(means) <- rownames(sds) <- corrs[ ,1]
corrs <- corrs[ ,-1]; sds <- sds[ ,-1]; means <- means[ ,-1]
Now we need to calculate the sd(y)/sd(x). The best way I have done this, and seen it done is using an apply affiliated function.
sdst <- data.frame(t(apply(sds, 1, function(X) X[1]/X)))
I use X[1] because the first variable in sds is my y-variable. The first variable after you have deleted timepoint is Abs which is your y-variable. So use that.
Now the rest is pretty straight forward. Since everything is saved as a data frame, to find slope, all it you need to do is
slopes <- sdst*corrs
inter <- slopes*means
intercept <- data.frame(t(apply(inter, 1, function(x) x[1]-x)))
Again here, since our y-variable is in the first column, we use x[1]. To check if all is well, your slopes for your y-variable should be 1 and the intercept should be 0.
I have solved the issue with a simpler approach, so I wanted to update the answer.
To make life easier I converted the data frame structure so that all columns are converted into rows with the melt() function of the reshape package.
melt(slope, id = c("Abs", "timepoint"), variable_name = "Sites")
The output's column name is by default "value".
Then create one column that adds both predictors with paste().
slope$FullTreat <- paste(slope$Sites,slope$timepoint, sep="_")
Run a function through the dataset to create separate models for each treatment combination.
models <- dlply(slope, ~ FullTreat, function(df) {
lm(value ~ Abs, data = df)
})
To extract the coefficents simply run
coefs <- ldply(models, coef)
Then split the FullTreat column into separate columns again with colsplit() also from reshape. Plus, add the Intercept and slope to the new data frame:
coefs <- cbind(colsplit(coefs$FullTreat, split="_",
c("Sites","Timepoint")), coefs[,2:3])
I haven't worked on a function that plots all the regressions from the models, but I guess this is feasible with the ldply() function.

log- and z-transforming my data in R

I'm preparing my data for a PCA, for which I need to standardize it. I've been following someone else's code in vegan but am not getting a mean of zero and SD of 1, as I should be.
I'm using a data set called musci which has 13 variables, three of which are labels to identify my data.
log.musci<-log(musci[,4:13],10)
stand.musci<-decostand(log.musci,method="standardize",MARGIN=2)
When I then check for mean=0 and SD=1...
colMeans(stand.musci)
sapply(stand.musci,sd)
I get mean values ranging from -8.9 to 3.8 and SD values are just listed as NA (for every data point in my data set rather than for each variable). If I leave out the last variable in my standardization, i.e.
log.musci<-log(musci[,4:12],10)
the means don't change, but the SDs now all have a value of 1.
Any ideas of where I've gone wrong?
Cheers!
You data is likely a matrix.
## Sample data
dat <- as.matrix(data.frame(a=rnorm(100, 10, 4), b=rexp(100, 0.4)))
So, either convert to a data.frame and use sapply to operate on columns
dat <- data.frame(dat)
scaled <- sapply(dat, scale)
colMeans(scaled)
# a b
# -2.307095e-16 2.164935e-17
apply(scaled, 2, sd)
# a b
# 1 1
or use apply to do columnwise operations
scaled <- apply(dat, 2, scale)
A z-transformation is quite easy to do manually.
See below using a random string of data.
data <- c(1,2,3,4,5,6,7,8,9,10)
data
mean(data)
sd(data)
z <- ((data - mean(data))/(sd(data)))
z
mean(z) == 0
sd(z) == 1
The logarithm transformation (assuming you mean a natural logarithm) is done using the log() function.
log(data)
Hope this helps!

Resources