Permutations of correlation coefficients - r

My question is on the permutation of correlation coefficients.
A<-data.frame(A1=c(1,2,3,4,5),B1=c(6,7,8,9,10),C1=c(11,12,13,14,15 ))
B<-data.frame(A2=c(6,7,7,10,11),B2=c(2,1,3,8,11),C2=c(1,5,16,7,8))
cor(A,B)
# A2 B2 C2
# A1 0.9481224 0.9190183 0.459588
# B1 0.9481224 0.9190183 0.459588
# C1 0.9481224 0.9190183 0.459588
I obtained this correlation and then wanted to perform permutation tests to check if the correlation still holds.
I did the permutation as follows:
A<-as.vector(t(A))
B<-as.vector(t(B))
corperm <- function(A,B,1000) {
# n is the number of permutations
# x and y are the vectors to correlate
obs = abs(cor(A,B))
tmp = sapply(1:n,function(z) {abs(cor(sample(A,replace=TRUE),B))})
return(1-sum(obs>tmp)/n)
}
The result was
[1] 0.645
and using "cor.test"
cor.test(A,B)
Pearson's product-moment correlation
data: A and B
t = 0.4753, df = 13, p-value = 0.6425
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.4089539 0.6026075
sample estimates:
cor
0.1306868
How could I draw a plot or a histogram to show the actual correlation and the permuted correlation value from the permuted data ???

first of all, you can't have done it exactly this ways as ...
> corperm = function(A,B,1000) {
Error: unexpected numeric constant in "corperm = function(A,B,1000"
The third argument has no name but it should have one! Perhaps you meant
> corperm <- function(A, B, n=1000) {
# etc
Then you need to think about what do you want to achieve. Initially you have two data sets with 3 variables and then you collapse them into two vectors and compute a correlation between the permuted vectors. Why does it make sense? The structure of permuted data set should be the same as the original data set.
obs = abs(cor(A,B))
tmp = sapply(1:n,function(z) {abs(cor(sample(A,replace=TRUE),B))})
return(1-sum(obs>tmp)/n)
Why do you use replace=TRUE here? This makes sense if you would like to have bootstrap CI-s but (a) it'd be better to use a dedicated function then e.g boot from package boot, and (B) you'd need to do the same with B, i.e. sample(B, replace=TRUE).
For permutation test you sample without replacement and it makes no difference whether you do it for both A and B or only A.
And how to get the histogram? Well, hist(tmp) would draw you a histogram of the permuted values, and obs is absolute value of the observed correlation.
HTHAB
(edit)
corperm <- function(x, y, N=1000, plot=FALSE){
reps <- replicate(N, cor(sample(x), y))
obs <- cor(x,y)
p <- mean(reps > obs) # shortcut for sum(reps > obs)/N
if(plot){
hist(reps)
abline(v=obs, col="red")
}
p
}
Now you can use this on a single pair of variables:
corperm(A[,1], B[,1])
To apply it to all pairs, use for or mapply. for is easier to understand so I wouldn't insist in using mapply to get all possible pairs.
res <- matrix(NA, nrow=NCOL(A), ncol=NCOL(B))
for(iii in 1:3) for(jjj in 1:3) res[iii,jjj] <- corperm(A[,iii], B[,jjj], plot=FALSE)
rownames(res)<-names(A)
colnames(res) <- names(B)
print(res)
To make all histograms, use plot=TRUE above.

I think there is not much significance to do permutation test for correlation analysis of two variants, because the cor.test()function offers "p.value" which has the same effect as permutation test.

Related

Computing Spearman's rho for increasing subsets of rows in for Loop

I am trying to fit a for Loop in R in order to run correlations for multiple subsets in a data frame and then store the results in a vector.
What I have in this loop is a data frame with 2 columns, x and y, and 30 rows of different continuous measurement values in each column. The process should be repeated 100 times. The data can be invented.
What I need, is to compute the Spearman's rho for the first five rows (between x and y) and then for increasing subsets (e.g., the sixth first rows, the sevenths first rows etc.). Then, I'd need to store the rho results in a vector that I can further use.
What I had in mind (but does not work):
sortvector <- 1:(30)
for (i in 1:100)
{
sortvector <- sample(sortvector, replace = F)
xtemp <- x[sortvector]
rho <- cor.test(xtemp,y, method="spearman")$estimate
}
The problem is that the code gives me one value of rho for the whole dataframe, but I need it for increments of subsets.
How can I get rho for subsets of increasing values in a for-loop? And how can i store the coefficients in a vector that i can use afterwards?
Any help would be much appreciated, thanks.
Cheers
The easiest approach is to convertfor loop into sapply function, which returns a vector of rho's as a result of your bootstrapping:
sortvector <- 1:(30)
x <- rnorm(30)
y <- rnorm(30)
rho <- sapply(1:100, function(i) {
sortvector <- sample(sortvector, replace = F)
xtemp <- x[sortvector]
cor.test(xtemp, y, method = "spearman")$estimate
})
head(rho)
Output:
rho rho rho rho rho rho
0.014460512 -0.239599555 0.003337041 -0.126585095 0.007341491 0.264516129

how do i perform ks test on multiple columns in a matrix?

(a) Generate 1000 samples where each consists of 50 independent exponential random variables with
mean 1. Estimate the mean of each sample. Draw a histogram of the means.
(b) Perform a KS test on each sample against the null hypothesis that they are from an exponential
random variable with a mean that matches the mean of the data set. Draw a histogram of the
1000 values of D.
i did part a with this code
set.seed(0)
simdata = rexp(50000, 1)
matrixdata = matrix(simdata,nrow=50,ncol=1000)
means.exp = apply(matrixdata,2,mean)
means.exp
hist(means.exp)
but im stuck on part (b)
You can use lapply on the column indices:
# KS test on every column
# H0: pexp(rate = 1/mean(column))
lst.ks <- lapply(1:ncol(matrixdata), function(i)
ks.test(matrixdata[, i], "pexp", 1.0/means.exp[i]))
Or directly without having to rely on means.exp:
lst.ks <- lapply(1:ncol(matrixdata), function(i)
ks.test(matrixdata[, i], "pexp", 1.0/mean(matrixdata[, i])))
Here 1.0/means.exp[i] corresponds to the rate of the exponential distribution.
PS. Using means.exp = colMeans(matrixdata) is faster than apply(matrixdata, 2, mean), see e.g. here for a relevant SO post.
To extract the test statistic and store it in a vector simply sapply over the KS test results:
# Extract test statistic as vector
Dstat <- sapply(lst.ks, function(x) x$statistic);
# (gg)plot Dstat
ggplot(data.frame(D = Dstat), aes(D)) + geom_histogram(bins = 30);

How to solve equations symbolically in a loop in R?

I need to conduct Gaussian Maximum Likelihood Classification for 1000 data sets of two classes of bivariate Gaussian distributions with each 100 data points.
Here is the code to create the data sets:
# mean vector for two classes
mean1<-c(70,130) ; mean2<-c(148,160)
# covariance matrix for two classes
cov1<-matrix(c(784,-546,-546,900),nrow=2,ncol=2,byrow=TRUE)
cov2<-matrix(c(484,285.1,285.1,324),nrow=2,ncol=2,byrow=TRUE)
library(MASS)
# Number of samples
nrs <- 1000
# sample size
ss <- 100
# number of dimensions
d <- length(mean1)
set.seed(1)
# generation of bivariate normal random variables based on mean vector and covariance matrix for each class
refdata_1 <- replicate(nrs,matrix(mvrnorm(ss, mu = mean1, Sigma = cov1 ),ncol = d,nrow = ss),simplify=FALSE)
refdata_2 <- replicate(nrs,matrix(mvrnorm(ss, mu = mean2, Sigma = cov2 ),ncol = d,nrow = ss),simplify=FALSE)
# calculation of mean vector for each sample of random reference data
mean_refdata_1 <- lapply(refdata_1,colMeans)
mean_refdata_2 <- lapply(refdata_2,colMeans)
# calculation of covariance matrix for each sample of random reference data
cov_refdata_1 <- lapply(refdata_1,cov)
cov_refdata_2 <- lapply(refdata_2,cov)
Now, I need to plot the decision boundary between the two classes for each of the 1000 data sets (thus 1000 decision boundaries).
Here is the decision equation (if you wonder where the ln p(class) part is, both classes have same probability and thus cancel each other out):
This is the vector of the data points:
x = vector(SR,var('a,b'))
Here is the decision equation (if you wonder where the ln p(class) part is, both classes have same probability and thus cancel each other out):
decision1 =-0.5*log(det(cov1))-0.5*((x-mean1)*cov1.inverse()*(x-mean1))
decision2 =-0.5*log(det(cov2))-0.5*((x-mean2)*cov2.inverse()*(x-mean2))
If decision1(data point) > decision2(data point), then the data point belongs to class 1. In order to get the decision boundary, decision1 - decision2 == 0. The data points are RBG images. Thus, a in the data vector x is 0:255. I solve the equation for b:
solve(decision1-decision2==0,b)
In R, that looks for the original data set like this:
m_1<-c(70,130) ; m_2<-c(148,160)
covma_1<-matrix(c(784,-546,-546,900),nrow=2,ncol=2,byrow=TRUE)
covma_2<-matrix(c(484,285.1,285.1,324),nrow=2,ncol=2,byrow=TRUE)
library(rSymPy)
c11 <- Var("c11")
c12 <- Var("c12")
c13 <- Var("c13")
c14 <- Var("c14")
sympy("covma_1 = Matrix([[c11,c12], [c13,c14]])")
a <- Var("a")
b <- Var("b")
sympy("x = Matrix([a,b])")
m11 <- Var("m11")
m12 <- Var("m12")
sympy("m_1 = Matrix([m11,m12])")
sympy("covma_1=covma_1.subs(c11,784)")
sympy("covma_1=covma_1.subs(c12,-546)")
sympy("covma_1=covma_1.subs(c13,-546)")
sympy("covma_1=covma_1.subs(c14,900)")
sympy("m_1=m_1 .subs(m11,70)")
sympy("m_1=m_1 .subs(m12,130)")
first <-sympy("-0.5*log(covma_1.det())")
second <-sympy("-0.5*((x-m_1).T*covma_1.inv()*(x-m_1))")
second<-gsub("\\[","",second)
second<-gsub("\\]","",second)
c21 <- Var("c21")
c22 <- Var("c22")
c23 <- Var("c23")
c24 <- Var("c24")
sympy("covma_2 = Matrix([[c21,c22], [c23,c24]])")
m21 <- Var("m21")
m22 <- Var("m22")
sympy("m_2 = Matrix([m21,m22])")
sympy("covma_2=covma_2.subs(c21,484)")
sympy("covma_2=covma_2.subs(c22,285.1)")
sympy("covma_2=covma_2.subs(c23,285.1)")
sympy("covma_2=covma_2.subs(c24,324)")
sympy("m_2=m_2.subs(m21,148)")
sympy("m_2=m_2.subs(m22,160)")
third <-sympy("-0.5*log(covma_2.det())")
fourth <-sympy("-0.5*((x-m_2).T*covma_2.inv()*(x-m_2))")
fourth<-gsub("\\[","",fourth)
fourth<-gsub("\\]","",fourth)
class1 <- paste(c(first,second),collapse="")
class2 <- paste(c(third,fourth),collapse="")
sympy(paste(c("hm=solve(",class2,"-","(",class1,")",",b)"), collapse = ""))
As you can see, I use very nasty string operations to parse into sympy. Anyway, after solving for b in sympy, I stuck and don't know how to get numeric values. Can somebody tell me how to solve symbolically for b and plot it in a loop for 1000 data sets? I m also open for non-symbolic approaches.
Any help is appreciated!

nls() in R using entire matrix

I have data which I want to fit to the following equation using R:
Z(u,w)=z0*F(w)*[1-exp((-b*u)/F(w))]
where z0 and b are constants and F(w), w=0,...,9 is a decreasing step function that depends on w with F(0)=1 and u=1,...,50.
Z(u,w) is an observed set of data in the form of a 50x10 matrix (u=50,...,1 down the side of the rows and w=0,...,9 along the columns). For example as I haven't explained that great, Z(42,3) will be the element in the 9th row down and the 4th column along.
Using F(0)=1 I was able to get estimates of b and z0 using just the first column (ie w=0) with the code:
n0=nls(zuw~z0*(1-exp(-b*u)),start=list(z0=283,b=0.03),options(digits=10))
I then found F(w) for w=1,...,9 by going through each columns and using the vlaues of b and z0 I found.
However, I was wanting to find a way to estimate all the 12 parameters at once (b, z0 and the 10 values of F(w)) as b and z0 should be fitted to all the data, not just the first column.
Does anyone know of any way of doing this? All help would be greatly appreciated!
Thanks
James
This may be a case where the formula interface of the nls(...) function works against you. As an alternative, you can use nls.lm(...) in the minpack.lm package to perform non-linear regression with a programmatically defined function. To demonstrate this, first we create an artificial dataset which follows your functional form by design, with random error added (error ~ N[0,1]).
u <- 1:50
w <- 0:9
z0 <- 100
b <- 0.02
F <- 10/(10+w^2)
# matrix containing data, in OP's format: rows are u, cols are w
m <- do.call(cbind,lapply(w,function(w)
z0*F[w+1]*(1-exp(-b*u/F[w+1]))+rnorm(length(u),0,1)))
So now we have a matrix m, which is equivalent to your dataset. This matrix is in the so-called "wide" format - the response for different values of w is in different columns. We need it in "long" format: all responses in a single column, with a separate columns identifying u and w. We do this using melt(...) in the reshape2 package.
# prepend values of u
df.wide <- data.frame(u=u, m)
library(reshape2)
# reshape to long format: col1 = u, col2=w, col3=z
df <- melt(df.wide,id="u",variable.name="w", value.name="z")
df$w <- as.numeric(substr(df$w,2,4))-1
Now we have a data frame df with columns u, w, and z. The nls.lm(...) function takes (at least) 4 arguments: par is a vector of initial estimates of the parameters of the fit, fn is a function that calculates the residuals at each step, observed is the dependent variable (z), and xx is a vector or matrix containing the independent variables (u, v).
Next we define a function, f(par, xx), where par is an 11 element vector. The first two elements contain estimates of z0 and b. The next 9 contain estimates of F(w), w=1:9. This is because you state that F(0) is known to be 1. xx is a matrix with two columns: the values for u and w respectively. f(par,xx) then calculates estimate of the response z for all values of u and w, for the given parameter estimates.
library(minpack.lm)
# model function
f <- function(pars, xx) {
z0 <- pars[1]
b <- pars[2]
F <- c(1,pars[3:11])
u <- xx[,1]
w <- xx[,2]
z <- z0*F[w+1]*(1-exp(-b*u/F[w+1]))
return(z)
}
# residual function
resids <- function(p, observed, xx) {observed - f(p,xx)}
Next we perform the regression using nls.lm(...), which uses a highly robust fitting algorithm (Levenberg-Marquardt). Consequently, we can set the par argument (containing the initial estimates of z0, b, and F) to all 1's, which is fairly distant from the values used in creating the dataset (the "actual" values). nls.lm(...) returns a list with several components (see the documentation). The par component contains the final estimates of the fit parameters.
# initial parameter estimates; all 1's
par.start <- c(z0=1, b=1, rep(1,9))
# fit using Levenberg-Marquardt algorithm
nls.out <- nls.lm(par=par.start,
fn = resids, observed = df$z, xx = df[,c("u","w")],
control=nls.lm.control(maxiter=10000, ftol=1e-6, maxfev=1e6))
par.final <- nls.out$par
results <- rbind(predicted=c(par.final[1:2],1,par.final[3:11]),actual=c(z0,b,F))
print(results,digits=5)
# z0 b
# predicted 102.71 0.019337 1 0.90456 0.70788 0.51893 0.37804 0.27789 0.21204 0.16199 0.13131 0.10657
# actual 100.00 0.020000 1 0.90909 0.71429 0.52632 0.38462 0.28571 0.21739 0.16949 0.13514 0.10989
So the regression has done an excellent job at recovering the "actual" parameter values. Finally, we plot the results using ggplot just to make sure this is all correct. I can't overwmphasize how important it is to plot the final results.
df$pred <- f(par.final,df[,c("u","w")])
library(ggplot2)
ggplot(df,aes(x=u, color=factor(w)))+
geom_point(aes(y=z))+ geom_line(aes(y=pred))

How can I perform a pairwise t.test in R across multiple independent vectors?

TL;DR edition
I have vectors X1,X2,X3,...Xn. I want to test to see whether the average value for any one vector is significantly different than the average value for any other vector, for every possible combination of vectors. I am seeking a better way to do this in R than running n^2 individual t.tests.
Full Story
I have a data frame full of census data for a particular CSA. Each row contains observations for each variable (column) for a particular census tract.
What I need to do is compare means for the same variable across census tracts in different MSAs. In other words, I want to factor my data.frame according to the MSA designation variable (which is one of the columns) and then compare the differences in the means for another variable of interest pairwise across each newly-factored MSA. This is essentially doing pairwise t.tests across each ensuing vector, but I wish to do this in a more elegant way than writing t.test(MSAx, MSAy) over and over again. How can I do this?
The advantage to my method below to the one proposed by #ashkan would be that mine removes duplicates. (i.e. either X1 vs X2 OR X2 vs X1 will appear in the results, not both)
# Generate dummy data
df <- data.frame(matrix(rnorm(100), ncol = 10))
colnames(df) <- paste0("X", 1:10)
# Create combinations of the variables
combinations <- combn(colnames(df),2, simplify = FALSE)
# Do the t.test
results <- lapply(seq_along(combinations), function (n) {
df <- df[,colnames(df) %in% unlist(combinations[n])]
result <- t.test(df[,1], df[,2])
return(result)})
# Rename list for legibility
names(results) <- paste(matrix(unlist(combinations), ncol = 2, byrow = TRUE)[,1], matrix(unlist(combinations), ncol = 2, byrow = TRUE)[,2], sep = " vs. ")
Just use pairwise.t.test, here is an example:
x1 <- rnorm(50)
x2 <- rnorm(30, mean=0.2)
x3 <- rnorm(100,mean=0.1)
x4 <- rnorm(100,mean=0.4)
x <- data.frame(data=c(x1,x2,x3,x4),
key=c(
rep("x1", length(x1)),
rep("x2", length(x2)),
rep("x3", length(x3)),
rep("x4", length(x4))) )
pairwise.t.test(x$data,
x$key,
pool.sd=FALSE)
# Pairwise comparisons using t tests with non-pooled SD
#
# data: x$data and x$key
#
# x1 x2 x3
# x2 0.7395 - -
# x3 0.9633 0.9633 -
# x4 0.0067 0.9633 0.0121
#
# P value adjustment method: holm
If you have a data.frame and you wish to independently perform T-tests between each column of the data.frame, you can use a double apply loop:
apply(MSA, 2, function(x1) {
apply(MSA, 2, function(x2) {
t.test(x1, x2)
})
})
A good visualization to accompany such a brute force approach would be a forest plot:
cis <- apply(MSA, 2, function(x) mean(x) + c(-1, 1) * sd(x) * 1.96)
plot.new()
plot.window(xlim=c(1, ncol(cis)), ylim=range(cis))
segments(1:ncol(cis), cis[1, ], 1:ncol(cis), cis[2, ])
axis(1, at=1:ncol(cis), labels=colnames(MSA))
axis(2)
box()
abline(h=mean(MSA), lty='dashed')
title('Forest plot of 95% confidence intervals of MSA')
In addition to response from quarzgar, there are another method to perform pairwise ttest across multiple factors in R. Basically is a trick for the two (or more) factors used by creating a combination of factor levels.
Example with a 2x2 classical design:
df <- data.frame(Id=c(rep(1:100,2),rep(101:200,2)),
dv=c(rnorm(100,10,5),rnorm(100,20,7),rnorm(100,11,5),rnorm(100,12,6)),
Group=c(rep("Experimental",200),rep("Control",200)),
Condition=rep(c(rep("Pre",100),rep("Post",100)),2))
#ANOVA
summary(aov(dv~Group*Condition+Error(Id/Condition),data = df))
#post-hoc across all factors
df$posthoclevels <- paste(df$Group,df$Condition) #factor combination
pairwise.t.test(df$dv,df$posthoclevels)
# Pairwise comparisons using t tests with pooled SD
#
# data: df$dv and df$posthoclevels
#
# Control Post Control Pre Experimental Post
# Control Pre 0.60 - -
# Experimental Post <2e-16 <2e-16 -
# Experimental Pre 0.26 0.47 <2e-16
#
# P value adjustment method: holm

Resources