Paired sample t-test in R: a question of direction - r

I have a question about the sign of t in a paired-sample t-test using different data structures, but the same data. I know that the sign doesn't make a difference in terms of significance, but, it does generally tell the user if there have been decreases over time or increases over time. So, I need to make sure that the code I provide produces the same results OR, is explained correctly.
I have to explain the results (and code) as an example we're giving users of our software, which uses R (Rdotnet within a C# program) for statistics. I'm unclear as to the proper order of variables in both methods in R.
Method 1 uses two matrices
## Sets seed for repetitive number generation
set.seed(2820)
## Creates the matrices
preTest <- c(rnorm(100, mean = 145, sd = 9))
postTest <- c(rnorm(100, mean = 138, sd = 8))
## Runs paired-sample T-Test just on two original matrices
t.test(preTest,postTest, paired = TRUE)
The results show significance and with the positive t, tells me that there has been a reduction in the mean difference from preTest to PostTest.
Paired t-test
data: preTest and postTest
t = 7.1776, df = 99, p-value = 1.322e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
6.340533 11.185513
sample estimates:
mean of the differences
8.763023
However, most people are going to get their data not from two matrices, but, from a file with values for BEFORE and AFTER. I will have these data in a csv and import them during a demo. So, to mimic this, I need to create data frame in the structure that users of our software are used to seeing. 'pstt' should look like the dataframe I have after I import a csv.
Method 2: uses a flat-file structure
## Converts the matrices into a dataframe that looks like the way these
data are normally stored in a csv or Excel
ID <- c(1:100)
pstt <- data.frame(ID,preTest,postTest)
## Puts the data in a form that can be used by R (grouping var | data var)
pstt2 <- data.frame(
group = rep(c("preTest","postTest"),each = 100),
weight = c(preTest, postTest)
)
## Runs paired-sample T-Test on the newly structured data frame
t.test(weight ~ group, data = pstt2, paired = TRUE)
The results for this t-test has the t negative, which may indicate to the user that the variable under study has increased over time.
Paired t-test
data: weight by group
t = -7.1776, df = 99, p-value = 1.322e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-11.185513 -6.340533
sample estimates:
mean of the differences
-8.763023
Is there a way to define explicitly which group is the BEFORE and which is the AFTER? Or, do you have to have the AFTER group first in Method 2.
Thanks for any help/explanation.
Here is the full R program that I used:
## sets working dir
# setwd("C:\\Temp\\")
## runs file from command line
# source("paired_ttest.r",echo=TRUE)
## Sets seed for repetitive number generation
set.seed(2820)
## Creates the matrices
preTest <- c(rnorm(100, mean = 145, sd = 9))
postTest <- c(rnorm(100, mean = 138, sd = 8))
ID <- c(1:100)
## Converts the matrices into a dataframe that looks like the way these
data are normally stored
pstt <- data.frame(ID,preTest,postTest)
## Puts the data in a form that can be used by R (grouping var | data var)
pstt2 <- data.frame(
group = rep(c("preTest","postTest"),each = 100),
weight = c(preTest, postTest)
)
print(pstt2)
## Runs paired-sample T-Test just on two original matrices
# t.test(preTest,postTest, paired = TRUE)
## Runs paired-sample T-Test on the newly structured data frame
t.test(weight ~ group, data = pstt2, paired = TRUE)

Since group is a factor, the t.test will use the first level of that factor as the reference level. By default factor levels are sorted alphabetically to "AFTER" would come before "BEFORE" and "postTest" would be come before "preTest". You can explicitly set reference level of a factor with relevel().
t.test(weight ~ relevel(group, "preTest"), data = pstt2, paired = TRUE)

Related

Distribution of mean*standard deviation of sample from gaussian

I'm trying to assess the feasibility of an instrumental variable in my project with a variable I havent seen before. The variable essentially is an interaction between the mean and standard deviation of a sample drawn from a gaussian, and im trying to see what this distribution might look like. Below is what im trying to do, any help is much appreciated.
Generate a set of 1000 individuals with a variable x following the gaussian distribution, draw 50 random samples of 5 individuals from this distribution with replacement, calculate the means and standard deviation of x for each sample, create an interaction variable named y which is calculated by multiplying the mean and standard deviation of x for each sample, plot the distribution of y.
Beginners version
There might be more efficient ways to code this, but this is easy to follow, I guess:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
# As Ben suggested, we create a data.frame filled with NA values
samples <- data.frame(mean = rep(NA, N), sd = rep(NA, N))
# Now we use a loop to populate the data.frame
for(i in 1:N){
# draw 5 samples from population (without replacement)
# I assume you want to replace for each turn of taking 5
# If you want to replace between drawing each of the 5,
# I think it should be obvious how to adapt the following code
smpl <- sample(stat_pop, size = 5, replace = FALSE)
# the data.frame currently has two columns. In each row i, we put mean and sd
samples[i, ] <- c(mean(smpl), sd(smpl))
}
# $ is used to get a certain column of the data.frame by the column name.
# Here, we create a new column y based on the existing two columns.
samples$y <- samples$mean * samples$sd
# plot a histogram
hist(samples$y)
Most functions here use positional arguments, i.e., you are not required to name every parameter. E.g., rnorm(1000, mean = 0, sd = 1) is the same as rnorm(1000, 0, 1) and even the same as rnorm(1000), since 0 and 1 are the default values.
Somewhat more efficient version
In R, loops are very inefficient and, thus, ought to be avoided. In case of your question, it does not make any noticeable difference. However, for large data sets, performance should be kept in mind. The following might be a bit harder to follow:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
n = 5
# again, I set replace = FALSE here; if you meant to replace each individual
# (so the same individual can be drawn more than once in each "draw 5"),
# set replace = TRUE
# replicate repeats the "draw 5" action N times
smpls <- replicate(N, sample(stat_pop, n, replace = FALSE))
# we transform the output and turn it into a data.frame to make it
# more convenient to work with
samples <- data.frame(t(smpls))
samples$mean <- rowMeans(samples)
samples$sd <- apply(samples[, c(1:n)], 1, sd)
samples$y <- samples$mean * samples$sd
hist(samples$y)
General note
Usually, you should do some research on the problem before posting here. Then, you either find out how it works by yourself, or you can provide an example of what you tried. To this end, you can simply google each of the steps you outlined (e.g., google "generate random standard distribution R" in order to find out about the function rnorm().
Run ?rnorm to get help on the function in RStudio.

2-sample independent t-test where each of two columns is in different data frame

I need to run a 2-sample independent t-test, comparing Column1 to Column2. But Column1 is in DataframeA, and Column2 is in DataframeB. How should I do this?
Just in case relevant (feel free to ignore): I am a true beginner. My experience with R so far has been limited to running 2-sample matched t-tests within the same data frame by doing the following:
t.test(response ~ Column1,
data = (Dataframe1 %>%
gather(key = "Column1", value = "response", "Column1", "Column2")),
paired = TRUE)
TL;DR
t_test_result = t.test(DataframeA$Column1, DataframeB$Column2, paired=TRUE)
Explanation
If the data is paired, I assume that both dataframes will have the same number of observations (same number of rows). You can check this with nrow(DataframeA) == nrow(DataframeB) .
You can think of each column of a dataframe as a vector (an ordered list of values). The way that you have used t.test is by using a formula (y~x), and you were essentially saying: Given the dataframe specified in data, perform a t test to assess the significance in the difference in means of the variable response between the paired groups in Column1.
Another way of thinking about this is by grabbing the data in data and separating it into two vectors: the vector with observations for the first group of Column1, and the one for the second group. Then, for each vector, you compute the mean and stdev and apply the appropriate formula that will give you the t statistic and hence the p value.
Thus, you can just extract those 2 vectors separately and provide them as arguments to the t.test() function. I hope it was beginner-friendly enough ^^ otherwise let me know
EDIT: a few additions
(I was going to reply in the comments but realized I did not have space hehe)
Regarding the what #Ashish did in order to turn it into a Welch's test, I'd say it was to set var.equal = FALSE. The paired parameter controls whether the t-test is run on paired samples or not, and since your data frames have unequal number of rows, I'm suspecting the observations are not matched.
As for the Cohen's d effect size, you can check this stats exchange question, from which I copy the code:
For context, m1 and m2 are the group's means (which you can get with n1 = mean(DataframeA$Column1)), s1 and s2 are the standard deviations (s2 = sd(DataframeB$Column2)) and n1 and n2 the sample sizes (n2 = length(DataframeB$Column2))
lx <- n1- 1 # Number of observations in group 1
ly <- n2- 1 # # Number of observations in group 1
md <- abs(m1-m2) ## mean difference (numerator)
csd <- lx * s1^2 + ly * s2^2
csd <- csd/(lx + ly)
csd <- sqrt(csd) ## common sd computation
cd <- md/csd ## cohen's d
This should work for you
res = t.test(DataFrameA$Column1, DataFrameB$Column2, alternative = "two.sided", var.equal = FALSE)

Change recursive vector to atomic vector for t-test

I'm new to R and am trying to run a t-test for two means. I keep getting the error is.atomic is not TRUE. I know I need to make my data atomic, but I haven't found a way online.
I've ran code to check that the data is recursive and did a as.data.frame(mydata).
titanic_summary <- data.frame(Outcome = c("Survived", "Died"),
Mean_Age = c(28.34369, 30.62618),
N = c(342, 549),
Total_Missing = c(52, 125))
titanic_summary
Run a stats test (two sample T-test)
str(titanic_summary)
as.data.frame(titanic_summary)
is.atomic(titanic_summary)
is.recursive(titanic_summary)
titanic_test <- titanic_summary %>%
t.test(Outcome~Mean_Age)
Error in var(x) : is.atomic(x) is not TRUE
t.test does not work the way you seem to think. To avoid that particular error, you could instead use something like titanic_test <- t.test(Mean_Age ~ Outcome, data = titanic_summary) but that would just give you different errors, which comes down to the real question:
You presumably want to see whether there may be a relationship between age and survival, i.e. whether the difference in average ages of 2.28249 is significant but you will need the individual ages or some other additional information about dispersion for this
If you do use the detailed dataset then I suspect that what you really want is something like this:
library(titanic)
titanic_test <- t.test(Age ~ Survived, data = titanic_train)
which would give (for the Kaggle selected training set used in the titanic package)
> titanic_test
Welch Two Sample t-test
data: Age by Survived
t = 2.046, df = 598.84, p-value = 0.04119
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.09158472 4.47339446
sample estimates:
mean in group 0 mean in group 1
30.62618 28.34369

R: iterate a function over two lists simultaneously using lapply?

I have multiple factors dividing my data.
By one factor (uniqueGroup), I would like to subset my data, by another factor (distance), I want to first classify my data by "moving threshold", and then test statistical difference between groups.
I have created a function movThreshold to classify my data, and test it by wilcox.test. To vary the different threshold values, I just run
lapply(th.list, # list of thresholds
movThreshold, # my function
tab = tab, # original data
dependent = "infGrad") # dependent variable
Now I've realized, that in fact I need to firstly subset my data by uniqueGroup, and then vary the threshold value. But I am not sure, how to write it in my lapply code?
My dummy data:
set.seed(10)
infGrad <- c(rnorm(20, mean=14, sd=8),
rnorm(20, mean=13, sd=5),
rnorm(20, mean=8, sd=2),
rnorm(20, mean=7, sd=1))
distance <- rep(c(1:4), each = 20)
uniqueGroup <- rep(c("x", "y"), 40)
tab<-data.frame(infGrad, distance, uniqueGroup)
# Create moving threshold function &
# test for original data
# ============================================
movThreshold <- function(th, tab, dependent, ...) {
# Classify data
tab$group<- ifelse(tab$distance < th, "a", "b")
# Calculate wincoxon test - as I have only two groups
test<-wilcox.test(tab[[dependent]] ~ as.factor(group), # specify column name
data = tab)
# Put results in a vector
c(th, unique(tab$uniqueGroup), dependent, uniqueGroup, round(test$p.value, 3))
}
# Define two vectors to run through
# unique group
gr.list<-unique(tab$uniqueGroup)
# unique threshold
th.list<-c(2,3,4)
How to run lapply over two lists??
lapply(c(th.list,gr.list), # iterate over two vectors, DOES not work!!
movThreshold,
tab = tab,
dependent = "infGrad")
In my previous question (Kruskal-Wallis test: create lapply function to subset data.frame?), I've learnt how to iterate through individual subsets within a table:
lapply(split(tab, df$uniqueGroup), movThreshold})
But how to iterate through subsets, and through thresholds at once?
If I understood correctly what you're trying to do, here is a data.table solution:
library(data.table)
setDT(tab)[, lapply(th.list, movThreshold, tab = tab, dependent = "infGrad"), by = uniqueGroup]
Also, you can just do a nested lapply.
lapply(gr.list, function(z) lapply(th.list, movThreshold, tab = tab[uniqueGroup == z, ], dependent = "infGrad"))
I apologize, If I misunderstood what you're trying to do.

T-test with grouping variable

I've got a data frame with 36 variables and 74 observations. I'd like to make a two sample paired ttest of 35 variables by 1 grouping variable (with two levels).
For example: the data frame contains "age" "body weight" and "group" variables.
Now I suppose I can do the ttest for each variable with this code:
t.test(age~group)
But, is there a way to test all the 35 variables with one code, and not one by one?
Sven has provided you with a great way of implementing what you wanted to have implemented. I, however, want to warn you about the statistical aspect of what you are doing.
Recall that if you are using the standard confidence level of 0.05, this means that for each t-test performed, you have a 5% chance of committing Type 1 error (incorrectly rejecting the null hypothesis.) By the laws of probability, running 35 individual t-tests compounds your probability of committing type 1 error by a factor of 35, or more exactly:
Pr(Type 1 Error) = 1 - (0.95)^35 = 0.834
Meaning that you have about an 83.4% chance of falsely rejecting a null hypothesis. Basically what this means is that, by running so many T-tests, there is a very high probability that at least one of your T-tests is going to provide an incorrect result.
Just FYI.
An example data frame:
dat <- data.frame(age = rnorm(10, 30), body = rnorm(10, 30),
weight = rnorm(10, 30), group = gl(2,5))
You can use lapply:
lapply(dat[1:3], function(x)
t.test(x ~ dat$group, paired = TRUE, na.action = na.pass))
In the command above, 1:3 represents the numbers of the columns including the variables. The argument paired = TRUE is necessary to perform a paired t-test.

Resources