Z-scores rounded to infinity for small p-values in R - r

I am working with a genome-wide association study dataset, with p-values ranging from 1E-30 to 1. I have an R data frame "data" which includes a variable "p" for the p-values.
I need to perform genomic correction of the p-values, which I am doing using the following code:
p=data$p
Zsq = qchisq(1-p, 1)
lambda = median(Zsq)/0.456
newZsq = Zsq/lambda
Newp = 1-pchisq(newZsq, 1)
In the command on the second line, where I use the qchisq function to convert p-values to z-scores, z-scores for p-values < 1E-16 are being rounded to infinity. This means the p-values for my most significant data points are rounded to 0 after the genomic correction, and I lose their ranking.
Is there any way around this?

Read help(".Machine"). Then set lower.tail=FALSE and avoid taking differences with 1:
p <- 1e-17
Zsq = qchisq(p, 1, lower.tail=FALSE)
lambda = median(Zsq)/0.456
newZsq = Zsq/lambda
Newp = pchisq(newZsq, 1, lower.tail=FALSE)
#[1] 0.4994993

Related

How to vary multiple parameters with lapply in R

In an attempt to avoid nesting for loops 6-7 times, I am trying to use lapply to find the proportion of randomly drawn values (that are combined in a certain way) that exceed some arbitrary thresholds values. The problem is that I have several parameters that each vary a certain number of ways, and these, in turn, will affect how the values are combined. The goal is to use the results in an ANOVA to see how varying these parameters contributes to reaching those thresholds. However, I don't understand how to do this. I have a feeling that anonymous functions could be useful, but I don't understand how they work with more than 1 parameter.
I tried to simplify the code as much as possible. But again, there are just so many parameters that must be included.
trials = 10
data_means = c(0,1,2,3)
prior_samples = c(2, 8, 32)
data_SD = c(0.5, 1, 2)
thresholds = c(10, 30, 80)
The idea is that there are two distributions, data and prior, which I draw values from. I always draw one from data, but I draw a sample (see prior_samples) of values from the prior distribution. There are four different values that determine the mean of the data distribution (see data_means), but the values are drawn the same number of times (determined by trials) from each of these four "versions" of the data distribution. These are then put into nested lists:
set.seed(123)
data_list = list()
for (nMean in data_means){ #the data values
for (nTrial in 1:trials){
data_list[[paste(nMean, sep="_")]][[paste(nTrial, sep="_")]] = rnorm(1, nMean, 1)
}
}
prior_list = list()
for (nSamples in prior_samples){ #the prior values
for (nTrial in 1:trials){
prior_list[[paste(nSamples, sep="_")]][[paste(nTrial, sep="_")]] = rnorm(nSamples, 0, 1)
}
}
Then I create another list for the prior values, because I want to calculate the means and standard deviations (SD) of the samples of prior values. I include normal SD, as well as SD/2 and SD*2:
prior_SD = list("mean"=0, "standard_devations"=list("SD/2"=0, "SD"=0, "SD*2"=0))
prior_mean_SD = rep(list(prior_SD), trials)
prior_nested_list = list("2"=prior_mean_SD, "8"=prior_mean_SD, "32"=prior_mean_SD)
for (nSamples in 1:length(prior_samples)){
for (nTrial in 1:trials){
prior_nested_list[[nSamples]][[nTrial]][["mean"]]=mean(prior_list[[nSamples]][[nTrial]])
prior_nested_list[[nSamples]][[nTrial]][["standard_devations"]][["SD/2"]]=sum(sd(prior_list[[nSamples]][[nTrial]])/2)
prior_nested_list[[nSamples]][[nTrial]][["standard_devations"]][["SD"]]=sd(prior_list[[nSamples]][[nTrial]])
prior_nested_list[[nSamples]][[nTrial]][["standard_devations"]][["SD*2"]]=sum(sd(prior_list[[nSamples]][[nTrial]])*2)
}
}
Then I combinde the values from the data list and the last list, using list.zip from rlist:
library(rlist)
dataMean0 = list.zip(dMean0=data_list[["0"]], pSample2=prior_nested_list[["2"]],
pSample8=prior_nested_list[["8"]], pSample32=prior_nested_list[["32"]])
dataMean1 = list.zip(dMean1=data_list[["1"]], pSample2=prior_nested_list[["2"]],
pSample8=prior_nested_list[["8"]], pSample32=prior_nested_list[["32"]])
dataMean2 = list.zip(dMean2=data_list[["2"]], pSample2=prior_nested_list[["2"]],
pSample8=prior_nested_list[["8"]], pSample32=prior_nested_list[["32"]])
dataMean3 = list.zip(dMean3=data_list[["3"]], pSample2=prior_nested_list[["2"]],
pSample8=prior_nested_list[["8"]], pSample32=prior_nested_list[["32"]])
all_values = list(mean_difference0=dataMean0, mean_difference1=dataMean1,
mean_difference2=dataMean2, mean_difference3=dataMean3)
Now comes the tricky part. I combine the data values and the prior values in all_values by using this custom function for the Kullback-Leibler divergence. As you can see, there are 6 parameters that varies:
mean_diff refers to the means of the data distribution (data_means). It is named mean_diff beacsue it refers to the difference in mean between the prior distribution (which is always 0), and the data distribution (which can be 0, 1, 2 or 3).
trial refers to trials,
pSample refers to the numbers of samples drawn from the prior distribution (prior_samples)
p_SD refers to the calculations of the SD based on the prior samples (normal SD, SD/2, SD*2)
data_SD refers to the SD of the data distribution, determined by data_SD
threshold refers to thresholds
The Kullback-Leibler divergence function:
kld = function(mean_diff, trial, pSample, p_SD, data_SD, threshold){
prior_mean = all_values[[mean_diff]][[trial]][[pSample]][["mean"]]
data_mean = all_values[[mean_diff]][[trial]][["mean"]]
prior_SD = all_values[[mean_diff]][[trial]][[pSample]][["standard_devations"]][[p_SD]]
posterior_SD = sqrt(1/(1/
((all_values[[mean_diff]][[trial]][[pSample]][["standard_devations"]][[p_SD]]
*all_values[[mean_diff]][[trial]][[pSample]][["standard_devations"]][[p_SD]]))
+1/(data_SD*data_SD)))
length(
which(
(log(prior_SD/posterior_SD) +
(((posterior_SD*posterior_SD) +
(prior_mean -
(((data_SD*data_SD))/
((data_SD*data_SD)+(prior_SD*prior_SD))*prior_mean +
((prior_SD*prior_SD))/
((data_SD*data_SD)+(prior_SD*prior_SD))*data_mean))^2)
/(2*(prior_SD*prior_SD)))-0.5
+
log(posterior_SD/prior_SD) +
((((prior_SD*prior_SD)) +
(prior_mean -
(((data_SD*data_SD))/
((data_SD*data_SD)+(prior_SD*prior_SD))*prior_mean +
((prior_SD*prior_SD))/
((data_SD*data_SD)+(prior_SD*prior_SD))*data_mean))^2)
/(2*(posterior_SD*posterior_SD)))-0.5
)>=threshold))/trials
}
So the question is how can one use lapply on the list with all the values (all_values) while using all the different combinations of the six parameters that are included? The data I want to end up with is the proportions of values (percentage of trials) that exceed the thresholds in all the parameter combinations.
I can't find the info I need, so any tips would be appreciated.

GAM with "gp" smoother: how to retrieve the variogram parameters?

I am using the following geoadditive model
library(gamair)
library(mgcv)
data(mack)
mack$log.net.area <- log(mack$net.area)
gm2 <- gam(egg.count ~ s(lon,lat,bs="gp",k=100,m=c(2,10,1)) +
s(I(b.depth^.5)) +
s(c.dist) +
s(temp.20m) +
offset(log.net.area),
data = mack, family = tw, method = "REML")
Here I am using an exponential covariance function with range = 10 and power = 1 (m=c(2,10,1)). How can I retrieve from the results the variogram parameters (nugget, sill)? I couldn't find anything in the model output.
In smoothing approach the correlation matrix is specified so you only estimate variance parameter, i.e., the sill. For example, you've set m = c(2, 10, 1) to s(, bs = 'gp'), giving an exponential correlation matrix with range parameter phi = 10. Note that phi is not identical to range, except for spherical correlation. For many correlation models the actual range is a function of phi.
The variance / sill parameter is closely related to the smoothing parameter in penalized regression, and you can obtain it by dividing the scale parameter by smoothing parameter:
with(gm2, scale / sp["s(lon,lat)"])
#s(lon,lat)
# 26.20877
Is this right? No. There is a trap here: smoothing parameters returned in $sp are not real ones, and we need the following:
gm2_sill <- with(gm2, scale / sp["s(lon,lat)"] * smooth[[1]]$S.scale)
#s(lon,lat)
# 7.7772
And we copy in the range parameter you've specified:
gm2_phi <- 10
The nugget must be zero, since a smooth function is continuous. Using lines.variomodel function from geoR package, you can visualize the semivariogram for the latent Gaussian spatial random field modeled by s(lon,lat).
library(geoR)
lines.variomodel(cov.model = "exponential", cov.pars = c(gm2_sill, gm2_phi),
nugget = 0, max.dist = 60)
abline(h = gm2_sill, lty = 2)
However, be skeptical on this variogram. mgcv is not an easy environment to interpret geostatistics. The use of low-rank smoothers suggests that the above variance parameter is for parameters in the new parameter space rather than the original one. For example, there are 630 unique spatial locations in the spatial field for mack dataset, so the correlation matrix should be 630 x 630, and the full random effects should be a vector of length-630. But by setting k = 100 in s(, bs = 'gp') the truncated eigen decomposition and subsequent low-rank approximation reduce the random effects to length-100. The variance parameter is really for this vector not the original one. This might explain why the sill and the actual range do not agree with the data and predicted s(lon,lat).
## unique locations
loc <- unique(mack[, c("lon", "lat")])
max(dist(loc))
#[1] 15.98
The maximum distance between two spatial locations in the dataset is 15.98, but the actual range from the variogram seems to be somewhere between 40 and 60, which is too large.
## predict `s(lon, lat)`, using the method I told you in your last question
## https://stackoverflow.com/q/51634953/4891738
sp <- predict(gm2,
data.frame(loc, b.depth = 0, c.dist = 0, temp.20m = 0,
log.net.area = 0),
type = "terms", terms = "s(lon,lat)")
c(var(sp))
#[1] 1.587126
The predicted s(lon,lat) only has variance 1.587, but the sill at 7.77 is way much higher.

Constructing components from PLSR loadings in R

I want to compute the components for a set of variables using the loadings (weights) from a PLSR using the plsr function.
I thought that the components were computed by summing the values of each variable multiplied by the estimated loading (weight).
However, using the output from plsr and doing that doesn't give me the expected values:
Example:
library("pls")
data(oliveoil)
sens.pcr <- plsr(sensory ~ chemical, ncomp = 4, scale = F, data = oliveoil)
Extract loadings/weights:
df <- cbind(sens.pcr$loadings[,1],sens.pcr$loadings[,2],sens.pcr$loadings[,3],sens.pcr$loadings[,4])
One test observation:
firstrow <- oliveoil$chemical[1,]
Extract the components (scores):
scores <- sens.pcr$scores
Do the linear combination:
sum(firstrow*df[,1])
[1] -12.81924
Which is not equal to the first score scores[1,1] = 0.5100166
What is it that I am missing?
Using the sense.pcr$loadings.weigths didn't make any big difference either.

Different results when performing PCA in R with princomp() and principal ()

I tried to use princomp() and principal() to do PCA in R with data set USArressts. However, I got two different results for loadings/rotaion and scores.
First, I centered and normalised the original data frame so it is easier to compare the outputs.
library(psych)
trans_func <- function(x){
x <- (x-mean(x))/sd(x)
return(x)
}
A <- USArrests
USArrests <- apply(USArrests, 2, trans_func)
princompPCA <- princomp(USArrests, cor = TRUE)
principalPCA <- principal(USArrests, nfactors=4 , scores=TRUE, rotate = "none",scale=TRUE)
Then I got the results for the loadings and scores using the following commands:
princompPCA$loadings
principalPCA$loadings
Could you please help me to explain why there is a difference? and how can we interprete these results?
At the very end of the help document of ?principal:
"The eigen vectors are rescaled by the sqrt of the eigen values to produce the component loadings more typical in factor analysis."
So principal returns the scaled loadings. In fact, principal produces a factor model estimated by the principal component method.
In 4 years, I would like to provide a more accurate answer to this question. I use iris data as an example.
data = iris[, 1:4]
First, do PCA by the eigen-decomposition
eigen_res = eigen(cov(data))
l = eigen_res$values
q = eigen_res$vectors
Then the eigenvector corresponding to the largest eigenvalue is the factor loadings
q[,1]
We can treat this as a reference or the correct answer. Now we check the results by different r functions.
First, by function 'princomp'
res1 = princomp(data)
res1$loadings[,1]
# compare with
q[,1]
No problem, this function actually just return the same results as 'eigen'. Now move to 'principal'
library(psych)
res2 = principal(data, nfactors=4, rotate="none")
# the loadings of the first PC is
res2$loadings[,1]
# compare it with the results by eigendecomposition
sqrt(l[1])*q[,1] # re-scale the eigen vector by sqrt of eigen value
You may find they are still different. The problem is the 'principal' function does eigendecomposition on the correlation matrix by default. Note: PCA is not invariant with rescaling the variables. If you modify the code as
res2 = principal(data, nfactors=4, rotate="none", cor="cov")
# the loadings of the first PC is
res2$loadings[,1]
# compare it with the results by eigendecomposition
sqrt(l[1])*q[,1] # re-scale the eigen vector by sqrt of eigen value
Now, you will get the same results as 'eigen' and 'princomp'.
Summarize:
If you want to do PCA, you'd better apply 'princomp' function.
PCA is a special case of the Factor model or a simplified version of the factor model. It is just equivalent to eigendecomposition.
We can apply PCA to get an approximation of a factor model. It doesn't care about the specific factors, i.e. epsilons in a factor model. So, if you change the number of factors in your model, you will get the same estimations of the loadings. It is different from the maximum likelihood estimation.
If you are estimating a factor model, you'd better use 'principal' function, since it provides more functions, like rotation, calculating the scores by different methods, and so on.
Rescale the loadings of a PCA model doesn't affect the results too much. Since you still project the data onto the same optimal direction, i.e. maximize the variation in the resulting PC.
ev <- eigen(R) # R is a correlation matrix of DATA
ev$vectors %*% diag(ev$values) %*% t(ev$vectors)
pc <- princomp(scale(DATA, center = F, scale = T),cor=TRUE)
p <-principal(DATA, rotate="none")
#eigen values
ev$values^0.5
pc$sdev
p$values^0.5
#eigen vectors - loadings
ev$vectors
pc$loadings
p$weights %*% diag(p$values^0.5)
pc$loading %*% diag(pc$sdev)
p$loadings
#weights
ee <- diag(0,2)
for (j in 1:2) {
for (i in 1:2) {
ee[i,j] <- ev$vectors[i,j]/p$values[j]^0.5
}
};ee
#scores
s <- as.matrix(scale(DATA, center = T, scale = T)) %*% ev$vectors
scale(s)
p$scores
scale(pc$scores)

coin::wilcox_test versus wilcox.test in R

In trying to figure out which one is better to use I have come across two issues.
1) The W statistic given by wilcox.test is different from that of coin::wilcox_test. Here's my output:
wilcox_test:
Exact Wilcoxon Mann-Whitney Rank Sum Test
data: data$variableX by data$group (yes, no)
Z = -0.7636, p-value = 0.4489
alternative hypothesis: true mu is not equal to 0
wilcox.test:
Wilcoxon rank sum test with continuity correction
data: data$variable by data$group
W = 677.5, p-value = 0.448
alternative hypothesis: true location shift is not equal to 0
I'm aware that there's actually two values for W and that the smaller one is usually reported. When wilcox.test is used with comma instead of "~" I can get the other value, but this comes up as W = 834.5. From what I understand, coin::statistic() can return three different statistics using ("linear", "standarized", and "test") where "linear" is the normal W and "standardized" is just the W converted to a z-score. None of these match up to the W I get from wilcox.test though (linear = 1055.5, standardized = 0.7636288, test = -0.7636288). Any ideas what's going on?
2) I like the options in wilcox_test for "distribution" and "ties.method", but it seems that you can not apply a continuity correction like in wilcox.test. Am I right?
I encountered the same issue when trying to apply Wendt formula to compute effect sizes using the coin package, and obtained aberrant r values due to the fact that the linear statistic outputted by wilcox_test() is unadjusted.
A great explanation is already given here, and therefore I will simply address how to obtain adjusted U statistics with the wilcox_test() function. Let's use a the following data frame:
d <- data.frame( x = c(rnorm(n = 60, mean = 10, sd = 5), rnorm(n = 30, mean = 16, sd = 5)),
g = c(rep("a",times = 60), rep("b",times = 30)) )
We can perform identical tests with wilcox.test() and wilcox_test():
w1 <- wilcox.test( formula = x ~ g, data = d )
w2 <- wilcox_test( formula = x ~ g, data = d )
Which will output two distinct statistics:
> w1$statistic
W
321
> w2#statistic#linearstatistic
[1] 2151
The values are indeed totally different (albeit the tests are equivalent).
To obtain the U statistics identical to that of wilcox.test(), you need to subtract wilcox_test()'s output statistic by the minimal value that the sum of the ranks of the reference sample can take, which is n_1(n_1+1)/2.
Both commands take the first level in the factor of your grouping variable g as reference (which will by default be alphabetically ordered).
Then you can compute the smallest sum of the ranks possible for the reference sample:
n1 <- table(w2#statistic#x)[1]
And
w2#statistic#linearstatistic- n1*(n1+1)/2 == w1$statistic
should return TRUE
VoilĂ .
It seems to be one is performing Mann-Whitney's U and the other Wilcoxon rank test, which is defined in many different ways in literature. They are pretty much equivalent, just look at the p-value. If you want continuity correction in wilcox.test just use argument correct=T.
Check https://stats.stackexchange.com/questions/79843/is-the-w-statistic-outputted-by-wilcox-test-in-r-the-same-as-the-u-statistic

Resources