Remove NAs when using mapply for ttest in R - r

I would like to do a column-wise ttest between two dataframes in R. That is, ttest(df1$col1,df2$col1) ,ttest(df1$col2,df2$col2) and so on....The best option here is to use mapply or Map function. Something like:
mapply(t.test,tnav_DJF_histo.csv[,-1],tnav_DJF.csv[,-1])
works perfectly but if one of your df columns has NAs, it fails with this error:
Error in t.test.default(dots[[1L]][[1L]], dots[[2L]][[1L]]) :
not enough 'y' observations
Question: how can I use na.rm to get the job done? For example, if a column in tnav_DJF.csv[,-1] has Nas but no NAs in tnav_DJF_histo.csv[,-1], how can I tell mapply to to ignore or skip the analyses for these columns?
Many thanks.
aez.

You can do this with mapply and an anonymous function as follows:
Example data:
df1 <- data.frame(a=runif(20), b=runif(20), c=rep(NA,20))
df2 <- data.frame(a=runif(20), b=runif(20), c=c(NA,1:18,NA))
#notice df1's third column is just NAs
Solution:
Use mapply with an anonymous function as follows:
#anonumous function testing for NAs
mapply(function(x, y) {
if(all(is.na(x)) || all(is.na(y))) NULL else t.test(x, y, na.action=na.omit)
}, df1, df2)
Output:
$a
Welch Two Sample t-test
data: x and y
t = 1.4757, df = 37.337, p-value = 0.1484
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.0543192 0.3458648
sample estimates:
mean of x mean of y
0.5217619 0.3759890
$b
Welch Two Sample t-test
data: x and y
t = 1.1689, df = 37.7, p-value = 0.2498
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.0815067 0.3041051
sample estimates:
mean of x mean of y
0.5846343 0.4733351
$c
NULL
P.S. There is no na.rm argument in the t.test function to use. There is only a na.action argument but even if you set that to na.omit (which I have) you will still get an error if all the column elements are NA.
P.S.2 If some of the elements of either x or y are NA then the t.test function will run properly by omitting those elements. If you want to ignore calculating the t.test if any of the columns contains even a single NA, then you need to change the all in the above function to any.

Can you do something like
t.test2 <- function(col1, col2){
df <- complete.cases(cbind(col1, col2))
if(nrow(df) < 3){return(NA)}
t.test(df[, 1], df[, 2], na.rm = TRUE)
}
mapply(t.test2, csv1[, -1], csv2[, -2])

Related

Removing outliers can not runt cor.test()

I am extracting outliers from a single column of a dataset. Then I am attempting to run cor.test() on that column plus another column. I am getting error: Error in cor.test.default(dep_delay_noout, distance) : 'x' and 'y' must have the same length I assume this is because removing the outliers from one column caused it to be a different length vector than the other column, but am not sure what to do about it. I have tried mutating the dataset by adding a new column that lacked outliers, but unfortunately ran into the same problem. Does anybody know what to do? Below is my code.
dep_delay<-flights$dep_delay
dep_delay_upper<-quantile(dep_delay,0.997,na.rm=TRUE)
dep_delay_lower<-quantile(dep_delay,0.003,na.rm=TRUE)
dep_delay_out<-which(dep_delay>dep_delay_upper|dep_delay<dep_delay_lower)
dep_delay_noout<-dep_delay[-dep_delay_out]
distance<-flights$distance
cor.test(dep_delay_noout,distance)
You were almost there. In cor.test you also want to subset distance. Additionally, for the preprocessing you could use a quantile vector of length 2 and mapply to do the comparison in one step―just to write it more concise, actually your code is fine.
data('flights', package='nycflights13')
nna <- !is.na(flights$dep_delay)
(q <- quantile(flights$dep_delay[nna], c(0.003, 0.997)))
# 0.3% 99.7%
# -14 270
nout <- rowSums(mapply(\(f, q) f(flights$dep_delay[nna], q), c(`>`, `<`), q)) == 2
with(flights, cor.test(dep_delay[nout], distance[nout]))
# Pearson's product-moment correlation
#
# data: dep_delay[no_out] and distance[no_out]
# t = -12.409, df = 326171, p-value < 2.2e-16
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
# -0.02515247 -0.01829207
# sample estimates:
# cor
# -0.02172252

Using Rollapply to return both the Coefficient and RSquare

I have a dataset that looks something like this:
data.table(x=c(11:30),y=rnorm(20))
I would like to calculate the rolling regression coefficient and rsquared over the last 10 items:
dtset[,coefficient:=rollapply(1:20,width=10,FUN=function(a) {
subdtset <- dtset[a]
reg <- lm.fit(matrix(data=c(subdtset$x, rep(1,nrow(subdtset))), nrow=nrow(subdtset), ncol=2), subdtset$y)
return(coef(reg)[1])
},align="right",fill=NA)]
dtset[,rsquare:=rollapply(1:20,width=10,FUN=function(a) {
subdtset <- dtset[a]
reg <- lm.fit(matrix(data=c(subdtset$x, rep(1,nrow(subdtset))), nrow=nrow(subdtset), ncol=2), subdtset$y)
return(1 - sum((subdtset$y - reg$fitted.values)^2) / sum((subdtset$y - mean(subdtset$y, na.rm=TRUE))^2))
},align="right",fill=NA)]
The code above accomplishes this, but my dataset has millions of rows and I have multiple columns where I want to make these calculations so it is taking a very long time. I am hoping there is a way to speed things up:
Is there a better way to capture the last 10 items in rollapply rather than passing the row numbers as the variable a and then doing subdtset <- dtset[a]? I tried using .SD and .SDcols but was unable to get that to work. I can only figure out how to get rollapply to accept one column or vector as the input, not two columns/vectors.
Is there a way to return 2 values from one rollapply statement? I think I could get significant time savings if I only had to do the regression once, and then from that take the coefficient and calculate RSquare. It's pretty inefficient to do the same calculations twice.
Thanks for the help!
Use by.column = FALSE to pass both columns to the function. In the function calculate the slope and r squared directly to avoid the overhead of lm.fit. Note that rollapply can return a vector and that rollapplyr with an r on the end is right aligned. This also works if dtset consists of a single x column followed by multiple y columns as in the example below with the builtin anscombe data frame.
library(data.table)
library(zoo)
stats <- function(X, x = X[, 1], y = X[, -1]) {
c(slope = cov(x, y) / var(x), rsq = cor(x, y)^2)
}
rollapplyr(dtset, 10, stats, by.column = FALSE, fill = NA)
a <- anscombe[c("x3", "y1", "y2", "y3")]
rollapplyr(a, 3, stats, by.column = FALSE, fill = NA)
Check
We check the formulas using the built-in BOD data frame.
fm <- lm(demand ~ Time, BOD)
c(coef(fm)[[2]], summary(fm)$r.squared)
## [1] 1.7214286 0.6449202
stats(BOD)
## slope rsq
## 1.7214286 0.6449202

Confusion about calculating sample correlation in r

I have been tasked with manually calculating the sample correlation between two datasets (D$Nload and D$Pload), and then compare the result with R's in built cor() function.
I calculate the sample correlation with
cov(D$Nload,D$Pload, use="complete.obs")/(sd(D$Nload)*sd(D$Pload, na.rm=TRUE))
Which gives me the result 0.5693599
Then I try using R's cov() function
cor(D[, c("Nload","Pload")], use="pairwise.complete.obs")
which gives me the result:
Nload Pload
Nload 1.0000000 0.6244952
Pload 0.6244952 1.0000000
Which is a different result. Can anyone see where I've gone wrong?
This happens because when you call sd() on a single vector, it cannot check if the data is pairwise complete. Example:
x <- rnorm(100)
y <- rexp(100)
y[1] <- NA
df <- data.frame(x = x, y = y)
So here we have
df[seq(2), ]
x y
1 1.0879645 NA
2 -0.3919369 0.2191193
We see that while the second row is pairwise complete (all columns used for your computation are not NA), the first row is not. However, if you calculate sd() on just a single column, it doesn't have any information about the pairs. So in your case, sd(df$x) will use all the available data, although it should avoid the first row.
cov(df$x, df$y, use = "complete.obs") / (sd(df$x)*sd(df$y, na.rm=TRUE))
[1] 0.09301583
cor(df$x, df$y, use = "pairwise.complete.obs")
[1] 0.09313766
But if you remove the first row from your computation, the result is equal
df <- df[complete.cases(df), ]
cov(df$x, df$y, use = "complete.obs") / (sd(df$x)*sd(df$y, na.rm=TRUE))
[1] 0.09313766
cor(df$x, df$y, use = "pairwise.complete.obs")
[1] 0.09313766

R: Testing each level of a factor without creating new variables

Suppose I have a data frame with a binary grouping variable and a factor. An example of such a grouping variable could specify assignment to the treatment and control conditions of an experiment. In the below, b is the grouping variable while a is an arbitrary factor variable:
a <- c("a","a","a","b","b")
b <- c(0,0,1,0,1)
df <- data.frame(a,b)
I want to complete two-sample t-tests to assess the below:
For each level of a, whether there is a difference in the mean propensity to adopt that level between the groups specified in b.
I have used the dummies package to create separate dummies for each level of the factor and then manually performed t-tests on the resulting variables:
library(dummies)
new <- dummy.data.frame(df, names = "a")
t.test(new$aa, new$b)
t.test(new$ab, new$b)
I am looking for help with the following:
Is there a way to perform this without creating a large number of dummy variables via dummy.data.frame()?
If there is not a quicker way to do it without creating a large number of dummies, is there a quicker way to complete the t-test across multiple columns?
Note
This is similar to but different from R - How to perform the same operation on multiple variables and nearly the same as this question Apply t-test on many columns in a dataframe split by factor but the solution of that question no longer works.
Here is a base R solution implementing a chi-squired test for equality of proportions, which I believe is more likely to answer whatever question you're asking of your data (see my comment above):
set.seed(1)
## generate similar but larger/more complex toy dataset
a <- sample(letters[1:4], 100, replace = T)
b <- sample(0:1, 10, replace = T)
head((df <- data.frame(a,b)))
a b
1 b 1
2 b 0
3 c 0
4 d 1
5 a 1
6 d 0
## create a set of contingency tables for proportions
## of each level of df$a to the others
cTbls <- lapply(unique(a), function(x) table(df$a==x, df$b))
## apply chi-squared test to each contingency table
results <- lapply(cTbls, prop.test, correct = FALSE)
## preserve names
names(results) <- unique(a)
## only one result displayed for sake of space:
results$b
2-sample test for equality of proportions without continuity
correction
data: X[[i]]
X-squared = 0.18382, df = 1, p-value = 0.6681
alternative hypothesis: two.sided
95 percent confidence interval:
-0.2557295 0.1638177
sample estimates:
prop 1 prop 2
0.4852941 0.5312500
Be aware, however, that is you might not want to interpret your p-values without correcting for multiple comparisons. A quick simulation demonstrates that the chance of incorrectly rejecting the null hypothesis with at least one of of your tests can be dramatically higher than 5%(!) :
set.seed(11)
sum(
replicate(1e4, {
a <- sample(letters[1:4], 100, replace = T)
b <- sample(0:1, 100, replace = T)
df <- data.frame(a,b)
cTbls <- lapply(unique(a), function(x) table(df$a==x, df$b))
results <- lapply(cTbls, prop.test, correct = FALSE)
any(lapply(results, function(x) x$p.value < .05))
})
) / 1e4
[1] 0.1642
I dont exactly understand what this is doing from a statistical standpoint, but this code generates a list where each element is the output from the t.test() you run above:
a <- c("a","a","a","b","b")
b <- c(0,0,1,0,1)
df <- data.frame(a,b)
library(dplyr)
library(tidyr)
dfNew<-df %>% group_by(a) %>% summarise(count = n()) %>% spread(a, count)
lapply(1:ncol(dfNew), function (x)
t.test(c(rep(1, dfNew[1,x]), rep(0, length(b)-dfNew[1,x])), b))
This will save you the typing of t.test(foo, bar) continuously, and also eliminates the need for dummy variables.
Edit: I dont think the above method preserves the order of the columns, only the frequency of values measured as 0 or 1. If the order is important (again, I dont know the goal of this procedure) then you can use the dummy method and lapply through the data.frame you named new.
library(dummies)
new <- dummy.data.frame(df, names = "a")
lapply(1:(ncol(new)-1), function(x)
t.test(new[,x], new[,ncol(new)]))

Permutations of correlation coefficients

My question is on the permutation of correlation coefficients.
A<-data.frame(A1=c(1,2,3,4,5),B1=c(6,7,8,9,10),C1=c(11,12,13,14,15 ))
B<-data.frame(A2=c(6,7,7,10,11),B2=c(2,1,3,8,11),C2=c(1,5,16,7,8))
cor(A,B)
# A2 B2 C2
# A1 0.9481224 0.9190183 0.459588
# B1 0.9481224 0.9190183 0.459588
# C1 0.9481224 0.9190183 0.459588
I obtained this correlation and then wanted to perform permutation tests to check if the correlation still holds.
I did the permutation as follows:
A<-as.vector(t(A))
B<-as.vector(t(B))
corperm <- function(A,B,1000) {
# n is the number of permutations
# x and y are the vectors to correlate
obs = abs(cor(A,B))
tmp = sapply(1:n,function(z) {abs(cor(sample(A,replace=TRUE),B))})
return(1-sum(obs>tmp)/n)
}
The result was
[1] 0.645
and using "cor.test"
cor.test(A,B)
Pearson's product-moment correlation
data: A and B
t = 0.4753, df = 13, p-value = 0.6425
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.4089539 0.6026075
sample estimates:
cor
0.1306868
How could I draw a plot or a histogram to show the actual correlation and the permuted correlation value from the permuted data ???
first of all, you can't have done it exactly this ways as ...
> corperm = function(A,B,1000) {
Error: unexpected numeric constant in "corperm = function(A,B,1000"
The third argument has no name but it should have one! Perhaps you meant
> corperm <- function(A, B, n=1000) {
# etc
Then you need to think about what do you want to achieve. Initially you have two data sets with 3 variables and then you collapse them into two vectors and compute a correlation between the permuted vectors. Why does it make sense? The structure of permuted data set should be the same as the original data set.
obs = abs(cor(A,B))
tmp = sapply(1:n,function(z) {abs(cor(sample(A,replace=TRUE),B))})
return(1-sum(obs>tmp)/n)
Why do you use replace=TRUE here? This makes sense if you would like to have bootstrap CI-s but (a) it'd be better to use a dedicated function then e.g boot from package boot, and (B) you'd need to do the same with B, i.e. sample(B, replace=TRUE).
For permutation test you sample without replacement and it makes no difference whether you do it for both A and B or only A.
And how to get the histogram? Well, hist(tmp) would draw you a histogram of the permuted values, and obs is absolute value of the observed correlation.
HTHAB
(edit)
corperm <- function(x, y, N=1000, plot=FALSE){
reps <- replicate(N, cor(sample(x), y))
obs <- cor(x,y)
p <- mean(reps > obs) # shortcut for sum(reps > obs)/N
if(plot){
hist(reps)
abline(v=obs, col="red")
}
p
}
Now you can use this on a single pair of variables:
corperm(A[,1], B[,1])
To apply it to all pairs, use for or mapply. for is easier to understand so I wouldn't insist in using mapply to get all possible pairs.
res <- matrix(NA, nrow=NCOL(A), ncol=NCOL(B))
for(iii in 1:3) for(jjj in 1:3) res[iii,jjj] <- corperm(A[,iii], B[,jjj], plot=FALSE)
rownames(res)<-names(A)
colnames(res) <- names(B)
print(res)
To make all histograms, use plot=TRUE above.
I think there is not much significance to do permutation test for correlation analysis of two variants, because the cor.test()function offers "p.value" which has the same effect as permutation test.

Resources