How can I perform a pairwise t.test in R across multiple independent vectors? - r

TL;DR edition
I have vectors X1,X2,X3,...Xn. I want to test to see whether the average value for any one vector is significantly different than the average value for any other vector, for every possible combination of vectors. I am seeking a better way to do this in R than running n^2 individual t.tests.
Full Story
I have a data frame full of census data for a particular CSA. Each row contains observations for each variable (column) for a particular census tract.
What I need to do is compare means for the same variable across census tracts in different MSAs. In other words, I want to factor my data.frame according to the MSA designation variable (which is one of the columns) and then compare the differences in the means for another variable of interest pairwise across each newly-factored MSA. This is essentially doing pairwise t.tests across each ensuing vector, but I wish to do this in a more elegant way than writing t.test(MSAx, MSAy) over and over again. How can I do this?

The advantage to my method below to the one proposed by #ashkan would be that mine removes duplicates. (i.e. either X1 vs X2 OR X2 vs X1 will appear in the results, not both)
# Generate dummy data
df <- data.frame(matrix(rnorm(100), ncol = 10))
colnames(df) <- paste0("X", 1:10)
# Create combinations of the variables
combinations <- combn(colnames(df),2, simplify = FALSE)
# Do the t.test
results <- lapply(seq_along(combinations), function (n) {
df <- df[,colnames(df) %in% unlist(combinations[n])]
result <- t.test(df[,1], df[,2])
return(result)})
# Rename list for legibility
names(results) <- paste(matrix(unlist(combinations), ncol = 2, byrow = TRUE)[,1], matrix(unlist(combinations), ncol = 2, byrow = TRUE)[,2], sep = " vs. ")

Just use pairwise.t.test, here is an example:
x1 <- rnorm(50)
x2 <- rnorm(30, mean=0.2)
x3 <- rnorm(100,mean=0.1)
x4 <- rnorm(100,mean=0.4)
x <- data.frame(data=c(x1,x2,x3,x4),
key=c(
rep("x1", length(x1)),
rep("x2", length(x2)),
rep("x3", length(x3)),
rep("x4", length(x4))) )
pairwise.t.test(x$data,
x$key,
pool.sd=FALSE)
# Pairwise comparisons using t tests with non-pooled SD
#
# data: x$data and x$key
#
# x1 x2 x3
# x2 0.7395 - -
# x3 0.9633 0.9633 -
# x4 0.0067 0.9633 0.0121
#
# P value adjustment method: holm

If you have a data.frame and you wish to independently perform T-tests between each column of the data.frame, you can use a double apply loop:
apply(MSA, 2, function(x1) {
apply(MSA, 2, function(x2) {
t.test(x1, x2)
})
})
A good visualization to accompany such a brute force approach would be a forest plot:
cis <- apply(MSA, 2, function(x) mean(x) + c(-1, 1) * sd(x) * 1.96)
plot.new()
plot.window(xlim=c(1, ncol(cis)), ylim=range(cis))
segments(1:ncol(cis), cis[1, ], 1:ncol(cis), cis[2, ])
axis(1, at=1:ncol(cis), labels=colnames(MSA))
axis(2)
box()
abline(h=mean(MSA), lty='dashed')
title('Forest plot of 95% confidence intervals of MSA')

In addition to response from quarzgar, there are another method to perform pairwise ttest across multiple factors in R. Basically is a trick for the two (or more) factors used by creating a combination of factor levels.
Example with a 2x2 classical design:
df <- data.frame(Id=c(rep(1:100,2),rep(101:200,2)),
dv=c(rnorm(100,10,5),rnorm(100,20,7),rnorm(100,11,5),rnorm(100,12,6)),
Group=c(rep("Experimental",200),rep("Control",200)),
Condition=rep(c(rep("Pre",100),rep("Post",100)),2))
#ANOVA
summary(aov(dv~Group*Condition+Error(Id/Condition),data = df))
#post-hoc across all factors
df$posthoclevels <- paste(df$Group,df$Condition) #factor combination
pairwise.t.test(df$dv,df$posthoclevels)
# Pairwise comparisons using t tests with pooled SD
#
# data: df$dv and df$posthoclevels
#
# Control Post Control Pre Experimental Post
# Control Pre 0.60 - -
# Experimental Post <2e-16 <2e-16 -
# Experimental Pre 0.26 0.47 <2e-16
#
# P value adjustment method: holm

Related

automatization of lm tests with all possible var combinations and getting values for: shapiro.test(), bptest(),vif() in R

I´ve spent days searching for the optimal models which would fulfill all of the standard OLS assumptions (normal distribution, homoscedasticity, no multicollinearity) in R but with 12 variables, it´s impossible to find the optimal var combination. So I was trying to create a script which would automatize this process.
Here the sample code for calculations:
x1 <- runif(100, 0, 10)
x2 <- runif(100, 0, 10)
x3 <- runif(100, 0, 10)
x4 <- runif(100, 0, 10)
x5 <- runif(100, 0, 10)
df <- as.data.frame(cbind(x1,x2,x3,x4,x5))
library(lmtest)
library(car)
model <- lm(x1~x2+x3+x4+x5, data = df)
# check for normal distribution (Shapiro-Wilk-Test)
rs_sd <- rstandard(model)
shapiro.test(rs_sd)
# check for heteroskedasticity (Breusch-Pagan-Test)
bptest(model)
# check for multicollinearity
vif(model)
#-------------------------------------------------------------------------------
# models without outliers
# identify outliers (calculating the Cooks distance, if x > 4/(n-k-1) --> outlier
cooks <- round(cooks.distance(model), digits = 4)
df_no_out <- cbind(df, cooks)
df_no_out <- subset(df_no_out, cooks < 4/(100-4-1))
model_no_out <- lm(x1~x2+x3+x4+x5, data = df_no_out)
# check for normal distribution
rs_sd_no_out<- rstandard(model_no_out)
shapiro.test(rs_sd_no_out)
# check for heteroskedasticity
bptest(model_no_out)
# check for multicollinearity
vif(model_no_out)
What I have in mind is to loop through all of the var combinations and get the P-VALUES for the shapiro.test() and the bptest() or the VIF-values for all models created so I can compare the significance values or the multicollinearity resp. (in my dataset, the multicollinearity shouldn´t be a problem and since to check for multicollinearity the VIF test produces more values (for each var 1xVIF factor) which will be probably more challenging for implementing in the code), the p-values for shapiro.test + bptest() would suffice…).
I´ve tried to write several scripts which would automatize the process but without succeed (unfortunately I´m not a programmer).
I know there´re already some threads dealing with this problem
How to run lm models using all possible combinations of several variables and a factor
Finding the best combination of variables for high R-squared values
but I haven´t find a script which would also calculate JUST the P-VALUES.
Especially the tests for models without outliers are important because after removing the outliers the OLS assumptions are fullfilled in many cases.
I would really very appreciate any suggestions or help with this.
you are scratching the surface of what is now referred to as Statistical learning. the intro text is "Statistical Learning with applications in R" and the grad level text is "The Elements of Statistical learning".
to do what you need you use regsubsets() function from the "leaps" package. However if you read at least chapter 6 from the intro book you will discover about cross-validation and bootstrapping which are the modern way of doing model selection.
The following automates the models fitting and the tests you made afterwards.
There is one function that fits all possible models. Then a series of calls to the *apply functions will get the values you want.
library(lmtest)
library(car)
fitAllModels <- function(data, resp, regr){
f <- function(M){
apply(M, 2, function(x){
fmla <- paste(resp, paste(x, collapse = "+"), sep = "~")
fmla <- as.formula(fmla)
lm(fmla, data = data)
})
}
regr <- names(data)[names(data) %in% regr]
regr_list <- lapply(seq_along(regr), function(n) combn(regr, n))
models_list <- lapply(regr_list, f)
unlist(models_list, recursive = FALSE)
}
Now the data.
# Make up a data.frame to test the function above.
# Don't forget to set the RNG seed to make the
# results reproducible
set.seed(7646)
x1 <- runif(100, 0, 10)
x2 <- runif(100, 0, 10)
x3 <- runif(100, 0, 10)
x4 <- runif(100, 0, 10)
x5 <- runif(100, 0, 10)
df <- data.frame(x1, x2, x3, x4, x5)
First fit all models with "x1" as response and the other variables as possible regressors. The function can be called with one response and any number of possible regressors you want.
fit_list <- fitAllModels(df, "x1", names(df)[-1])
And now the sequence of tests.
# Normality test, standardized residuals
rs_sd_list <- lapply(fit_list, rstandard)
sw_list <- lapply(rs_sd_list, shapiro.test)
sw_pvalues <- sapply(sw_list, '[[', 'p.value')
# check for heteroskedasticity (Breusch-Pagan-Test)
bp_list <- lapply(fit_list, bptest)
bp_pvalues <- sapply(bp_list, '[[', 'p.value')
# check for multicollinearity,
# only models with 2 or more regressors
vif_values <- lapply(fit_list, function(fit){
regr <- attr(terms(fit), "term.labels")
if(length(regr) < 2) NA else vif(fit)
})
A note on the Cook's distance. In your code, you are subsetting the original data.frame, producing a new one without outliers. This will duplicate data. I have opted for a list of indices of the df's rows only. If you prefer the duplicated data.frames, uncomment the line in the anonymous function below and comment out the last one.
# models without outliers
# identify outliers (calculating the
# Cooks distance, if x > 4/(n - k - 1) --> outlier
df_no_out_list <- lapply(fit_list, function(fit){
cooks <- cooks.distance(fit)
regr <- attr(terms(fit), "term.labels")
k <- length(regr)
inx <- cooks < 4/(nrow(df) - k - 1)
#df[inx, ]
which(inx)
})
# This tells how many rows have the df's without outliers
sapply(df_no_out_list, NROW)
# A data.frame without outliers. This one is the one
# for model number 8.
# The two code lines could become a one-liner.
i <- df_no_out_list[[8]]
df[i, ]

Manually implementing Regression Likelihood Ratio Test

I'm trying to implement my own linear regression likelihood ratio test.
The test is where you take the sum of squares of a reduced model and the sum of squares of a full model and compare it to the F statistic.
However, I am having some trouble implementing the function, especially when dealing with dummy variables.
This is the dataset I am working with and testing the function on.
Here is the code so far:
The function inputs are the setup matrix mat, the response matrix which has just one column, the indices (variables) being test, and the alpha value the test is at.
linear_regression_likelihood <- function(mat, response, indices, alpha) {
mat <- as.matrix(mat)
reduced <- mat[,c(1, indices)]
q <- 1 #set q = 1 just to test on data
p <- dim(mat)[2]
n <- dim(mat)[1]
f_stat <- qf(1-alpha, df1 = p-q, df2 = n-(p+1))
beta_hat_full <- qr.solve(t(mat)%*%mat)%*%t(mat)%*%response
y_hat_full <- mat%*%beta_hat_full
SSRes_full <- t(response - y_hat_full)%*%(response-y_hat_full)
beta_hat_red <- qr.solve(t(reduced)%*%reduced)%*%t(reduced)%*%response
y_hat_red <- reduced%*%beta_hat_red
SSRes_red <- t(response - y_hat_red)%*%(response-y_hat_red)
s_2 <- (t(response - mat%*%beta_hat_full)%*%(response - mat%*%beta_hat_full))/(n-p+1)
critical_value <- ((SSRes_red - SSRes_full)/(p-q))/s_2
print(critical_value)
if (critical_value > f_stat) {
return ("Reject H0")
}
else {
return ("Fail to Reject H0")
}
}
Here is the setup code, where I setup the matrix in the correct format. Data is the read in CSV file.
data <- data[, 2:5]
mat <- data[, 2:4]
response <- data[, 1]
library(ade4)
df <-data.frame(mat$x3)
dummy <- acm.disjonctif(df)
dummy
mat <- cbind(1, mat[1:2], dummy)
linear_regression_likelihood(mat, response, 2:3, 0.05)
This is the error I keep getting.
Error in solve.default(as.matrix(c)) : system is computationally singular: reciprocal condition number = 1.63035e-18
I know it has to do with taking the inverse of the matrix after it is multiplied, but the function is unable to do so. I thought it may be due to the dummy variables having too small of values, but I am not sure of any other way to include the dummy variables.
The test I am doing is to check whether the factor variable x3 has any affect on the response y. The actual answer which I verified using the built in functions states that we fail to reject the null hypothesis.
The error originates from line
beta_hat_full <- qr.solve(t(mat)%*%mat)%*%t(mat)%*%response
If you go through your function step-by-step you will see an error
Error in qr.solve(t(mat) %*% mat) : singular matrix 'a' in solve
The problem here is that your model matrix does not have full column rank, which translates to your regression coefficients not being unique. This is a result of the way you "dummyfied" x3. In order to ensure full rank, you need to remove one dummy column (or manually remove the intercept).
In the following example I remove the A column from dummy which means that resulting x3 coefficients measure the effect of a unit-change in B, C, and D against A.
# Read data
data <- read.csv("data_hw5.csv")
data <- data[, 2:5]
# Extract predictor and response data
mat <- data[, 2:4]
response <- data[, 1]
# Dummify categorical predictor x3
library(ade4)
df <-data.frame(mat$x3)
dummy <- acm.disjonctif(df)
dummy <- dummy[, -1] # Remove A to have A as baseline
mat <- cbind(1, mat[1:2], dummy)
# Apply linear_regression_likelihood
linear_regression_likelihood(mat, response, 2:3, 0.05);
# [,1]
#[1,] 8.291975
#[1] "Reject H0"
A note
The error could have been avoided if you had used base R's function model.matrix which ensures full rank when "dummyfying" categorical variables (model.matrix is also implicitly called in lm and glm to deal with categorical, i.e. factor variables).
Take a look at
mm <- model.matrix(y ~ x1 + x2 + x3, data = data)
which by default omits the first level of factor variable x3. mm is identical to mat after (correct) "dummification".

Generate random values in R with a defined correlation in a defined range

For a science project, I am looking for a way to generate random data in a certain range (e.g. min=0, max=100000) with a certain correlation with another variable which already exists in R. The goal is to enrich the dataset a little so I can produce some more meaningful graphs (no worries, I am working with fictional data).
For example, I want to generate random values correlating with r=-.78 with the following data:
var1 <- rnorm(100, 50, 10)
I already came across some pretty good solutions (i.e. https://stats.stackexchange.com/questions/15011/generate-a-random-variable-with-a-defined-correlation-to-an-existing-variable), but only get very small values, which I cannot transform so the make sense in the context of the other, original values.
Following the example:
var1 <- rnorm(100, 50, 10)
n <- length(var1)
rho <- -0.78
theta <- acos(rho)
x1 <- var1
x2 <- rnorm(n, 50, 50)
X <- cbind(x1, x2)
Xctr <- scale(X, center=TRUE, scale=FALSE)
Id <- diag(n)
Q <- qr.Q(qr(Xctr[ , 1, drop=FALSE]))
P <- tcrossprod(Q) # = Q Q'
x2o <- (Id-P) %*% Xctr[ , 2]
Xc2 <- cbind(Xctr[ , 1], x2o)
Y <- Xc2 %*% diag(1/sqrt(colSums(Xc2^2)))
var2 <- Y[ , 2] + (1 / tan(theta)) * Y[ , 1]
cor(var1, var2)
What I get for var2 are values ranging between -0.5 and 0.5. with a mean of 0. I would like to have much more distributed data, so I could simply transform it by adding 50 and have a quite simililar range compared to my first variable.
Does anyone of you know a way to generate this kind of - more or less -meaningful data?
Thanks a lot in advance!
Starting with var1, renamed to A, and using 10,000 points:
set.seed(1)
A <- rnorm(10000,50,10) # Mean of 50
First convert values in A to have the new desired mean 50,000 and have an inverse relationship (ie subtract):
B <- 1e5 - (A*1e3) # Note that { mean(A) * 1000 = 50,000 }
This only results in r = -1. Add some noise to achieve the desired r:
B <- B + rnorm(10000,0,8.15e3) # Note this noise has mean = 0
# the amount of noise, 8.15e3, was found through parameter-search
This has your desired correlation:
cor(A,B)
[1] -0.7805972
View with:
plot(A,B)
Caution
Your B values might fall outside your range 0 100,000. You might need to filter for values outside your range if you use a different seed or generate more numbers.
That said, the current range is fine:
range(B)
[1] 1668.733 95604.457
If you're happy with the correlation and the marginal distribution (ie, shape) of the generated values, multiply the values (that fall between (-.5, +.5) by 100,000 and add 50,000.
> c(-0.5, 0.5) * 100000 + 50000
[1] 0e+00 1e+05
edit: this approach, or any thing else where 100,000 & 50,000 are exchanged for different numbers, will be an example of a 'linear transformation' recommended by #gregor-de-cillia.

R: Testing each level of a factor without creating new variables

Suppose I have a data frame with a binary grouping variable and a factor. An example of such a grouping variable could specify assignment to the treatment and control conditions of an experiment. In the below, b is the grouping variable while a is an arbitrary factor variable:
a <- c("a","a","a","b","b")
b <- c(0,0,1,0,1)
df <- data.frame(a,b)
I want to complete two-sample t-tests to assess the below:
For each level of a, whether there is a difference in the mean propensity to adopt that level between the groups specified in b.
I have used the dummies package to create separate dummies for each level of the factor and then manually performed t-tests on the resulting variables:
library(dummies)
new <- dummy.data.frame(df, names = "a")
t.test(new$aa, new$b)
t.test(new$ab, new$b)
I am looking for help with the following:
Is there a way to perform this without creating a large number of dummy variables via dummy.data.frame()?
If there is not a quicker way to do it without creating a large number of dummies, is there a quicker way to complete the t-test across multiple columns?
Note
This is similar to but different from R - How to perform the same operation on multiple variables and nearly the same as this question Apply t-test on many columns in a dataframe split by factor but the solution of that question no longer works.
Here is a base R solution implementing a chi-squired test for equality of proportions, which I believe is more likely to answer whatever question you're asking of your data (see my comment above):
set.seed(1)
## generate similar but larger/more complex toy dataset
a <- sample(letters[1:4], 100, replace = T)
b <- sample(0:1, 10, replace = T)
head((df <- data.frame(a,b)))
a b
1 b 1
2 b 0
3 c 0
4 d 1
5 a 1
6 d 0
## create a set of contingency tables for proportions
## of each level of df$a to the others
cTbls <- lapply(unique(a), function(x) table(df$a==x, df$b))
## apply chi-squared test to each contingency table
results <- lapply(cTbls, prop.test, correct = FALSE)
## preserve names
names(results) <- unique(a)
## only one result displayed for sake of space:
results$b
2-sample test for equality of proportions without continuity
correction
data: X[[i]]
X-squared = 0.18382, df = 1, p-value = 0.6681
alternative hypothesis: two.sided
95 percent confidence interval:
-0.2557295 0.1638177
sample estimates:
prop 1 prop 2
0.4852941 0.5312500
Be aware, however, that is you might not want to interpret your p-values without correcting for multiple comparisons. A quick simulation demonstrates that the chance of incorrectly rejecting the null hypothesis with at least one of of your tests can be dramatically higher than 5%(!) :
set.seed(11)
sum(
replicate(1e4, {
a <- sample(letters[1:4], 100, replace = T)
b <- sample(0:1, 100, replace = T)
df <- data.frame(a,b)
cTbls <- lapply(unique(a), function(x) table(df$a==x, df$b))
results <- lapply(cTbls, prop.test, correct = FALSE)
any(lapply(results, function(x) x$p.value < .05))
})
) / 1e4
[1] 0.1642
I dont exactly understand what this is doing from a statistical standpoint, but this code generates a list where each element is the output from the t.test() you run above:
a <- c("a","a","a","b","b")
b <- c(0,0,1,0,1)
df <- data.frame(a,b)
library(dplyr)
library(tidyr)
dfNew<-df %>% group_by(a) %>% summarise(count = n()) %>% spread(a, count)
lapply(1:ncol(dfNew), function (x)
t.test(c(rep(1, dfNew[1,x]), rep(0, length(b)-dfNew[1,x])), b))
This will save you the typing of t.test(foo, bar) continuously, and also eliminates the need for dummy variables.
Edit: I dont think the above method preserves the order of the columns, only the frequency of values measured as 0 or 1. If the order is important (again, I dont know the goal of this procedure) then you can use the dummy method and lapply through the data.frame you named new.
library(dummies)
new <- dummy.data.frame(df, names = "a")
lapply(1:(ncol(new)-1), function(x)
t.test(new[,x], new[,ncol(new)]))

Permutations of correlation coefficients

My question is on the permutation of correlation coefficients.
A<-data.frame(A1=c(1,2,3,4,5),B1=c(6,7,8,9,10),C1=c(11,12,13,14,15 ))
B<-data.frame(A2=c(6,7,7,10,11),B2=c(2,1,3,8,11),C2=c(1,5,16,7,8))
cor(A,B)
# A2 B2 C2
# A1 0.9481224 0.9190183 0.459588
# B1 0.9481224 0.9190183 0.459588
# C1 0.9481224 0.9190183 0.459588
I obtained this correlation and then wanted to perform permutation tests to check if the correlation still holds.
I did the permutation as follows:
A<-as.vector(t(A))
B<-as.vector(t(B))
corperm <- function(A,B,1000) {
# n is the number of permutations
# x and y are the vectors to correlate
obs = abs(cor(A,B))
tmp = sapply(1:n,function(z) {abs(cor(sample(A,replace=TRUE),B))})
return(1-sum(obs>tmp)/n)
}
The result was
[1] 0.645
and using "cor.test"
cor.test(A,B)
Pearson's product-moment correlation
data: A and B
t = 0.4753, df = 13, p-value = 0.6425
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.4089539 0.6026075
sample estimates:
cor
0.1306868
How could I draw a plot or a histogram to show the actual correlation and the permuted correlation value from the permuted data ???
first of all, you can't have done it exactly this ways as ...
> corperm = function(A,B,1000) {
Error: unexpected numeric constant in "corperm = function(A,B,1000"
The third argument has no name but it should have one! Perhaps you meant
> corperm <- function(A, B, n=1000) {
# etc
Then you need to think about what do you want to achieve. Initially you have two data sets with 3 variables and then you collapse them into two vectors and compute a correlation between the permuted vectors. Why does it make sense? The structure of permuted data set should be the same as the original data set.
obs = abs(cor(A,B))
tmp = sapply(1:n,function(z) {abs(cor(sample(A,replace=TRUE),B))})
return(1-sum(obs>tmp)/n)
Why do you use replace=TRUE here? This makes sense if you would like to have bootstrap CI-s but (a) it'd be better to use a dedicated function then e.g boot from package boot, and (B) you'd need to do the same with B, i.e. sample(B, replace=TRUE).
For permutation test you sample without replacement and it makes no difference whether you do it for both A and B or only A.
And how to get the histogram? Well, hist(tmp) would draw you a histogram of the permuted values, and obs is absolute value of the observed correlation.
HTHAB
(edit)
corperm <- function(x, y, N=1000, plot=FALSE){
reps <- replicate(N, cor(sample(x), y))
obs <- cor(x,y)
p <- mean(reps > obs) # shortcut for sum(reps > obs)/N
if(plot){
hist(reps)
abline(v=obs, col="red")
}
p
}
Now you can use this on a single pair of variables:
corperm(A[,1], B[,1])
To apply it to all pairs, use for or mapply. for is easier to understand so I wouldn't insist in using mapply to get all possible pairs.
res <- matrix(NA, nrow=NCOL(A), ncol=NCOL(B))
for(iii in 1:3) for(jjj in 1:3) res[iii,jjj] <- corperm(A[,iii], B[,jjj], plot=FALSE)
rownames(res)<-names(A)
colnames(res) <- names(B)
print(res)
To make all histograms, use plot=TRUE above.
I think there is not much significance to do permutation test for correlation analysis of two variants, because the cor.test()function offers "p.value" which has the same effect as permutation test.

Resources