Creating a function in r to run a chi squared test - r

The goal is to create a chi squared test function with arguments dat and res.type="pearson" that returns an R list containing the test statistic, p-value, expected counts, residual type, and the residuals of that type, where the expected counts and the residuals are stored in two r × c matrices.
I think I have figured out how to get the expected counts and test statistic but can't figure out the res.type argument. The pearson residual equation is (o_ij-e_ij)/sqrt(e_ij) and the standard residual equation is (o_ij-e_ij)/sqrt(e_ij(1-n_i./n_..)(1-n_.j/n..)
Here is what I have so far
chisquared <- function(dat, res.type) {
expdata <- matrix(c((dat[4,1]*dat[1,4])/dat[4,4],
(dat[4,2]*dat[1,4])/dat[4,4], (dat[4,3]*dat[1,4])/dat[4,4],
(dat[4,1]*dat[2,4])/dat[4,4], (dat[4,2]*dat[2,4])/dat[4,4],
(dat[4,3]*dat[2,4])/dat[4,4], (dat[4,1]*dat[3,4])/dat[4,4],
(dat[4,2]*dat[3,4])/dat[4,4], (dat[4,3]*dat[3,4])/dat[4,4]),
nrow=3, ncol=3, byrow="T")
sqdist <- matrix(c((dat[1,1]-expdata[1,1])^2/expdata[1,1],
(dat[1,2]-expdata[1,2])^2/expdata[1,2], (dat[1,3]-
expdata[1,3])^2/expdata[1,3], (dat[2,1]-
expdata[2,1])^2/expdata[2,1], (dat[2,2]-
expdata[2,2])^2/expdata[2,2], (dat[2,3]-
expdata[2,3])^2/expdata[2,3], (dat[3,1]-
expdata[3,1])^2/expdata[3,1], (dat[3,2]-
expdata[3,2])^2/expdata[3,2], (dat[3,3]-
expdata[3,3])^2/expdata[3,3]), nrow=3, ncol=3, byrow="T")
ts <- sum(sqdist[1,1], sqdist[1,2], sqdist[1,3], sqdist[2,1],
sqdist[2,2], sqdist[2,3], sqdist[3,1], sqdist[3,2], sqdist[3,3])
}
I'm sure there is an easier way to do this, other than the chisq.test function as I am not allowed to use it, so if anyone could provide some advice for that as well it would be appreciated

Related

Row-wise correlation between two matrices

I have the following matrices :
And:
I want to calculate their row-wise pearson correlations and I've tried these pieces of code:
RowCor<- sapply(1:21, function(i) cor(EurodistCL.scl[i,], EurodistM.scl[i,], method = "pearson"))
And:
cA <- EurodistCL.scl - rowMeans(EurodistCL.scl)
cB <- EurodistM.scl- rowMeans(EurodistM.scl)
sA <- sqrt(rowMeans(cA^2))
sB <- sqrt(rowMeans(cB^2))
rowMeans(cA * cB) / (sA * sB)
Both give the same output, a correlation vector of 21 ones.
Although the matrices are clearly highly correlated, they are not perfectly correlated so I would expect some correlation coefficient to be 0.99 or 0.98
Why am I getting only ones? Is something wrong in the code or in the theory?
It is because you have only two values in a row. Even random values would give (+ or -) 1. Try this
a <- runif(2)
b <- runif(2)
cor(a, b)
So, it is the theory that is incorrect. Although one can get a coefficient of correlation with two samples, it is of little use.
To estimate correlation coefficient, you need more than two corresponding samples.

How to update code to create a function for calculating Welch's for polynomial trends?

I am trying to reproduce the SPSS output for significance a linear trend among means when equal variances are not assumed.
I have gratefully used code from http://www-personal.umich.edu/~gonzo/coursenotes/file3.pdf to create a function for calculating separate variances, which based on my searching I understand as the “equal variances not assumed” output in SPSS.
My problem/goal:
I am only assessing polynomial orthogonal trends (mostly linear). I want to adapt the code creating the function so that the contrast argument can take pre-made contrast matrices rather than manually specifying the coefficients each time (room for typos!).
… I have tried those exact commands but receive Error in contrast %*% means : non-conformable arguments . I have played around with the code but I can’t get it to work.
Code for creating the function from the notes:
sepvarcontrast <- function(dv, group, contrast) {
means <- c(by(dv, group, mean))
vars <- c(by(dv, group, var))
ns <- c(by(dv, group, length))
ihat <- contrast %*% means
t.denominator <- sqrt(contrast^2 %*% (vars/ns))
t.welch <- ihat/ t.denominator
num.contrast <- ifelse(is.null(dim(contrast)),1,dim(contrast)[1])
df.welch <- rep(0, num.contrast)
if (is.null(dim(contrast))) contrast <- t(as.matrix(contrast))
for (i in 1:num.contrast) {
num <- (contrast[i,]^2 %*% (vars))^2
den <- sum((contrast[i,]^2 * vars)^2 / (ns-1))
df.welch[i] <- num/den
}
p.welch <- 2*(1- pt(abs(t.welch), df.welch))
result <- list(ihat = ihat, se.ihat = t.denominator, t.welch = t.welch,
df.welch = df.welch, p.welch = p.welch)
return(result)
}
I would like to be able to use the function like this:
# Create a polynomial contrast matrix for 5 groups, then save
contr.mat5 <- contr.poly(5)
# Calculate separate variance
sepvarcontrast(dv, group, contrast = contr.mat5)
I have tried those exact commands to see if they would work but receive Error in contrast %*% means : non-conformable arguments.
All suggestions are appreciated! I am still learning how to create a reprex...

Extract p-value from gam.check in R

When I run gam.check(my_spline_gam), I get the following output.
Method: GCV Optimizer: magic
Smoothing parameter selection converged after 9 iterations.
The RMS GCV score gradiant at convergence was 4.785628e-06 .
The Hessian was positive definite.
The estimated model rank was 25 (maximum possible: 25)
Model rank = 25 / 25
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
s(x) 24.000 22.098 0.849 0.06
My question is whether I can extract this p-value separately to a table.
Looks like you cannot store the result in an object the normal way. You could use capture.output to store the console output in an object, and then subsequently use str_split to get the correct value. So for the example in the help file this would be:
library(mgcv)
set.seed(0)
dat <- gamSim(1,n=200)
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),data=dat)
r <- capture.output(gam.check(b))
p <- strsplit(r[12], " ")[[1]][11]
But because the p-value is just a string you wouldn't get the exact p-value this way.
Edit: user20650's answer will give you the proper output:
r <- k.check(b)
r[,'p-value']
Use capture.output coupled with a little string manipulation -
gam_obj <- capture.output(gam.check(b,pch=19,cex=.3))
gam_tbl <- gam_obj[12:length(gam_obj)]
str_spl = function(x){
p_value <- strsplit(x, " ")[[1]]
output_p <- as.numeric(p_value[length(p_value)])
}
p_values <- data.frame(sapply(gam_tbl, str_spl))
Output

Different results when performing PCA in R with princomp() and principal ()

I tried to use princomp() and principal() to do PCA in R with data set USArressts. However, I got two different results for loadings/rotaion and scores.
First, I centered and normalised the original data frame so it is easier to compare the outputs.
library(psych)
trans_func <- function(x){
x <- (x-mean(x))/sd(x)
return(x)
}
A <- USArrests
USArrests <- apply(USArrests, 2, trans_func)
princompPCA <- princomp(USArrests, cor = TRUE)
principalPCA <- principal(USArrests, nfactors=4 , scores=TRUE, rotate = "none",scale=TRUE)
Then I got the results for the loadings and scores using the following commands:
princompPCA$loadings
principalPCA$loadings
Could you please help me to explain why there is a difference? and how can we interprete these results?
At the very end of the help document of ?principal:
"The eigen vectors are rescaled by the sqrt of the eigen values to produce the component loadings more typical in factor analysis."
So principal returns the scaled loadings. In fact, principal produces a factor model estimated by the principal component method.
In 4 years, I would like to provide a more accurate answer to this question. I use iris data as an example.
data = iris[, 1:4]
First, do PCA by the eigen-decomposition
eigen_res = eigen(cov(data))
l = eigen_res$values
q = eigen_res$vectors
Then the eigenvector corresponding to the largest eigenvalue is the factor loadings
q[,1]
We can treat this as a reference or the correct answer. Now we check the results by different r functions.
First, by function 'princomp'
res1 = princomp(data)
res1$loadings[,1]
# compare with
q[,1]
No problem, this function actually just return the same results as 'eigen'. Now move to 'principal'
library(psych)
res2 = principal(data, nfactors=4, rotate="none")
# the loadings of the first PC is
res2$loadings[,1]
# compare it with the results by eigendecomposition
sqrt(l[1])*q[,1] # re-scale the eigen vector by sqrt of eigen value
You may find they are still different. The problem is the 'principal' function does eigendecomposition on the correlation matrix by default. Note: PCA is not invariant with rescaling the variables. If you modify the code as
res2 = principal(data, nfactors=4, rotate="none", cor="cov")
# the loadings of the first PC is
res2$loadings[,1]
# compare it with the results by eigendecomposition
sqrt(l[1])*q[,1] # re-scale the eigen vector by sqrt of eigen value
Now, you will get the same results as 'eigen' and 'princomp'.
Summarize:
If you want to do PCA, you'd better apply 'princomp' function.
PCA is a special case of the Factor model or a simplified version of the factor model. It is just equivalent to eigendecomposition.
We can apply PCA to get an approximation of a factor model. It doesn't care about the specific factors, i.e. epsilons in a factor model. So, if you change the number of factors in your model, you will get the same estimations of the loadings. It is different from the maximum likelihood estimation.
If you are estimating a factor model, you'd better use 'principal' function, since it provides more functions, like rotation, calculating the scores by different methods, and so on.
Rescale the loadings of a PCA model doesn't affect the results too much. Since you still project the data onto the same optimal direction, i.e. maximize the variation in the resulting PC.
ev <- eigen(R) # R is a correlation matrix of DATA
ev$vectors %*% diag(ev$values) %*% t(ev$vectors)
pc <- princomp(scale(DATA, center = F, scale = T),cor=TRUE)
p <-principal(DATA, rotate="none")
#eigen values
ev$values^0.5
pc$sdev
p$values^0.5
#eigen vectors - loadings
ev$vectors
pc$loadings
p$weights %*% diag(p$values^0.5)
pc$loading %*% diag(pc$sdev)
p$loadings
#weights
ee <- diag(0,2)
for (j in 1:2) {
for (i in 1:2) {
ee[i,j] <- ev$vectors[i,j]/p$values[j]^0.5
}
};ee
#scores
s <- as.matrix(scale(DATA, center = T, scale = T)) %*% ev$vectors
scale(s)
p$scores
scale(pc$scores)

Summary statistics for imputed data from Zelig & Amelia

I'm using Amelia to impute the missing values.
While I'm able to use Zelig and Amelia to do some calculations...
How do I use these packages to find the pooled means and standard deviations of the newly imputed data?
library(Amelia)
library(Zelig)
n= 100
x1= rnorm(n,0,1) #random normal distribution
x2= .4*x1+rnorm(n,0,sqrt(1-.4)^2) #x2 is correlated with x1, r=.4
x1= ifelse(rbinom(n,1,.2)==1,NA,x1) #randomly creating missing values
d= data.frame(cbind(x1,x2))
m=5 #set 5 imputed data frames
d.imp=amelia(d,m=m) #imputed data
summary(d.imp) #provides summary of imputation process
I couldn't figure out how to format the code in a comment so here it is.
foo <- function(x, fcn) apply(x, 2, fcn)
lapply(d.imp$imputations, foo, fcn = mean)
lapply(d.imp$imputations, foo, fcn = sd)
d.imp$imputations gives a list of all the imputed data sets. You can work with that list however you are comfortable with to get out the means and sds by column and then pool as you see fit. Same with correlations.
lapply(d.imp$imputations, cor)
Edit: After some discussion in the comments I see that what you are looking for is how to combine results using Rubin's rules for, for example, the mean of imputed data sets generated by Amelia. I think you should clarify in the title and body of your post that what you are looking for is how to combine results over imputations to get appropriate standard errors with Rubin's rules after imputing with package Amelia. This was not clear from the title or original description. "Pooling" can mean different things, particularly w.r.t. variances.
The mi.meld function is looking for a q matrix of estimates from each imputation, an se matrix of the corresponding se estimates, and a logical byrow argument. See ?mi.meld for an example. In your case, you want the sample means and se_hat(sample means) for each of your imputed data sets in the q and se matrices to pass to mi_meld, respectively.
q <- t(sapply(d.imp$imputations, foo, fcn = mean))
se <- t(sapply(d.imp$imputations, foo, fcn = sd)) / sqrt(100)
output <- mi.meld(q = q, se = se, byrow = TRUE)
should get you what you're looking for. For other statistics than the mean, you will need to get an SE either analytically, if available, or by, say, bootstrapping, if not.

Resources