Hmisc wtd.var confusion: calculated weighted variance in R - standards

I'm trying to calculate a weighted variance for a group of studies in a meta analysis for which I have individual means and variances:
variance #available variance values for studies
[1] 0.705600 NA 2.102500 0.672400 0.980100 0.494209 NA 5.317636
4.431025 NA NA
[12] 0.184900
number2 #patient numbers for studies with variance
[1] 16 NA 52 15 42 22 NA 114 40 NA NA 48
Do I need to use a weighted variance function like wtd.var from Hmisc? Or can I simply weight the variance values according to the sample size in a similar way to weighted means?
When I use the wtd.var function on the above data I get a weighted variance 2.35 which is higher than the variance in any of the studies. Intuitively I expected the weighted variance to be somewhere between the min and max values of the individual study variances.
Thanks so much in advance for the help, I'm new to R and statistics and really struggling!

The mean of the variances gives a “typical” or “expected” variance in light of  your data.  If you want the mean variance, with the mean weighted by number of patients, then request
# I renamed your variables a bit:
variances <- c(0.705600, NA, 2.102500, 0.672400, 0.980100, 0.494209,  NA, 5.317636, 4.431025, NA, NA, 0.184900)
n.pts <- c(16,  NA,  52,  15,  42,  22,  NA, 114,  40,  NA,  NA,  48)
wtd.mean(x = variances, weights = n.pts)
[1] 2.793894
As you anticipated, the (weighted) mean variance is toward the center of the source value range.
The variance of the variances, in contrast, describes the spread of your variances, not their central tendency (second rather than first moment of the distribution).  There is wide diversity among the variances, which the calculation captures.
wtd.var(x = variances, weights = n.pts)
[1] 4.530746
The distinction might be more obvious if we recenter the data, so the mean and variance of the distribution differ more.
variances <- variances + 20
wtd.mean(variances, weights = n.pts)
[1] 22.79389
wtd.var(variances, weights = n.pts)
[1] 4.530746

Related

Comparing means (int) by zipcode (factor) in R

I have a list of zipcodes and the number of covid deaths per zipcode in a data frame (not real numbers, just examples):
City
Total
Richmond
552
Las Vegas
994
San Francisco
388
I want to see if there is any relationship between zipcode and the total number of deaths.
I made an LM model using the LM() function
mod_zip <- lm(Total ~ City, data=zipcode)
But when I call summary(mod_zip) I get NA for everything except the estimate column.
Coefficients
Estimate
Std. Error
t value
Pr(>
t)
CityRichmond
2851
NA
NA
NA
NA
CityLasVegas
-2604
NA
NA
NA
NA
CitySanFran
-966
NA
NA
NA
NA
What am I doing wrong?
lm will turn the factor into one-hot columns, so you have a parameter for each city except one and a global intercept.
Then (assuming, without seeing your data) you try to estimate n data points with n parameters which it manages to do but it doesn't have enough degrees of freedom to get a standard error.
Simplified Example to reproduce:
df <- data.frame(x = LETTERS, y = rnorm(26), stringsAsFactors = TRUE)
fit <- lm(y~x, data = df)
summary(fit)
You will see an Intercept and parameters B through Z (26 parameters for 26 observations), the degrees of freedom are thus 0 hence the standard errors and related metrics not calculable.
It sounds like you are looking to test whether City is a relevant factor for predicting deaths. In other words, would you expect to see the observed range of values if each death had an equal chance of occurring in any City?
My intuition on this would be that there should certainly be a difference, based on many City-varying differences in demographics, rules, norms, vaccination rates, and the nature of an infectious disease that spreads more if more people are infected to begin with.
If you want to confirm this intuition, you could use simulation. Let's say all Cities had the same underlying risk rate of 800, and all variation was totally due to chance.
set.seed(2021)
Same_risk = 800
Same_risk_deaths = rpois(100, Same_risk)
mean(Same_risk_deaths)
sd(Same_risk_deaths)
Here, the observed mean is indeed close to 800, with a standard deviation of around 3% of the average value.
If we instead had a situation where some cities, for whatever combination of reasons, had different risk factors (say, 600 or 1000), then we could see the same average around 800, but with a much higher standard deviation around 25% of the average value.
Diff_risk = rep(c(600, 1000), 50)
Diff_risk_deaths = rpois(100, Diff_risk)
mean(Diff_risk_deaths)
sd(Diff_risk_deaths)
I imagine your data does not look like the first distribution and is instead much more varied.

Sum of N independent standard normal variables

I wanted to simulate sum of N independent standard normal variables.
sums <- c(1:5000)
for (i in 1:5000) {
sums[i] <- sum(rnorm(5000,0,1))
}
I tried to draw N=5000 standard normal and sum them. Repeat for 5000 simulation paths.
I would expect the expectation of sums be 0, and variance of sums be 5000.
> mean(sums)
[1] 0.4260789
> var(sums)
[1] 5032.494
The simulated expectation is too big. When I tried it again, I got 1.309206 for the mean.
#ilir is correct, the value you get is essentially zero.
If you look at the plot, you get values between -200 and 200. 0.42 is for all intents and purposes 0.
You can test this with t.test.
> t.test(sums, mu = 0)
One Sample t-test
data: sums
t = -1.1869, df = 4999, p-value = 0.2353
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-3.167856 0.778563
sample estimates:
mean of x
-1.194646
There is no evidence that your mean values differs from zero (given the null hypothesis is true).
This is just plain normal that the mean does not fall exactly on 0, because it is an empirical mean computed from "only" 5000 realizations of the random variable.
However, the distribution of your realizations contained in the sumsvector should "look" Gaussian.
For example, when I try to plot the histogram and the qqplot obtained of 10000 realizations of the sum of 5000 gaussian laws (created in this way: sums <- replicate(1e4,sum(rnorm(5000,0,1)))), it looks normal, as you can see on the following figures:
hist(sums)
qqnorm(sums)
Sum of the independent normals is again normal, with mean the sum of the means and the variance the sum of variance. So sum(rnorm(5000,0,1)) is equivalent to rnorm(1,0,sqrt(5000)). The sample average of normals is again the normal variable. In your case you take a sample average of 5000 independent normal variables with zero mean and variance 5000. This is a normal variable with zero mean and unit variance, i.e. the standard normal.
So in your case mean(sums) is identical to rnorm(1). So any value from interval (-1.96,1.96) will come up 95% of the time.

Sample classification by probability

I have fitted mixture distributions to multi-modal biological measurement data in order to group individuals accordingly (picture a multi-modal histogram of length measurements; assuming each mode represents a different age cohort I can infer numbers at age from the easily measured length data).
The mixture distribution provides posterior probabilities for each individual's membership to each mode, and so once binned by length class one line of data might look like:
l.class freq age1 age2 age3 age5
9 41 0.2 0.25 0.3 0.25
Where l.class is the length bin, freq is the number of individuals, and age1, age2, age3 and age5 are the probabilities of association with a given mixture mode / age group. As these are probabilities as opposed to proportions I wanted to iterate over each entry a number of times in order to get an estimate of numbers at age for each length bin.
I have tried using sample() to achieve this in R, but cannot get my head around the classification to one of a number of potential groups according to probability.
x <- sample(names(data1)[3:ncol(data1)], data1$freq, replace=T, prob=c(data1[i,3:ncol(data1)]))
Here is the approach I ended up using. I wanted to run the sampling in a loop in order to sample by probabilities a number of times (i.e. 1000), so I did this and then took the mean number of samples for each age class as my estimate.

Approximate the distribution of a sum of binomial random variables in R

My goal is approximate the distribution of a sum of binomial variables.
I use the following paper The Distribution of a Sum of Binomial Random Variables by Ken Butler and Michael Stephens.
I want to write an R script to find Pearson approximation to the sum of binomials.
There is an R-package PearsonDS that allows do this in a simple way.
So I take the first example from the paper and try to find density of the Pearson distribution for this case.
Finally i get an error message "There are no probability distributions with these moments".
Could you please explain me what's wrong in the below code?
library(PearsonDS)
# define parameters for five binomial random varibles
n<-rep(5,5)
p<-seq(0.02,0.10,0.02)
# find the first four cumulants
k.1<-sum(n*p)
k.2<-sum(n*p*(1-p))
k.3<-sum(n*p*(1-p)*(1-2*p))
k.4<-sum(n*p*(1-p)*(1-6*p*(1-p)))
# find the skewness and kurtosis parameters
beta.1<-k.3^2/k.2^3
beta.2<-k.4/k.2^2
# define the moments and calculate
moments <- c(mean=k.1,variance=k.2,skewness=sqrt(beta.1),kurtosis=beta.2)
dpearson(1:7,moments=moments)
I get the error message "There are no probability distributions with these moments".
What you try to insert as kurtosis in your moments, is actually the excess kurtosis, which is just kurtosis - 3. From the help-page of dpearson():
moments:
optional vector/list of mean, variance, skewness, kurtosis (not excess kurtosis).
So adding 3 to beta.2 will provide you with the real kurtosis:
beta.1 <- (k.3^2)/(k.2^3)
beta.2 <- k.4/(k.2^2)
kurt <- beta.2 + 3
moments <- c(mean = k.1, variance = k.2, skewness = beta.1, kurtosis = kurt)
dpearson(1:7, moments=moments)
# [1] 0.3438773545 0.2788412385 0.1295129534 0.0411140817 0.0099279576
# [6] 0.0019551512 0.0003294087
To get a result like the one in the paper, we should investigate the cumulative distribution function and add 0.5 to correct for the bias caused by approximating a discrete distribution by a continuous one:
ppearson(1:7+0.5, moments = moments)
# [1] 0.5348017 0.8104394 0.9430092 0.9865434 0.9973715 0.9995578 0.9999339
A little background information:
The function threw an error because the relationship between kurtosis and skewness wasn't invalid: kurtosis is lower-bounded by the skewness in the following way: kurtosis >= (skewness)^2 - 1. The proof ain't pretty and is certainly beyond the scope of the question, but you can check out the references below if you like for different versions of this inequality.
Wilkins, J. Ernest. A Note on Skewness and Kurtosis. Ann. Math. Statist. 15 (1944), no. 3, 333--335. http://projecteuclid.org/euclid.aoms/1177731243.
K. Pearson. Mathematical contributions to the theory of evolution, XIX; second supplement to a memoir on skew variation. Philos. Trans. Roy. Soc. London Ser. A, 216 (1916), p. 432 http://rsta.royalsocietypublishing.org/content/216/538-548/429
Pearson, K. (1929). "Editorial note to 'Inequalities for moments of frequency functions and for various statistical constants'". Biometrika. 21 (1–4): 361–375. link

Spearman correlation and ties

I'm computing Spearman's rho on small sets of paired rankings.
Spearman is well known for not handling ties properly. For example, taking 2 sets of 8 rankings, even if 6 are ties in one of the two sets, the correlation is still very high:
> cor.test(c(1,2,3,4,5,6,7,8), c(0,0,0,0,0,0,7,8), method="spearman")
Spearman's rank correlation rho
S = 19.8439, p-value = 0.0274
sample estimates:
rho
0.7637626
Warning message:
Cannot compute exact p-values with ties
The p-value <.05 seems like a pretty high statistical significance for this data.
Is there a ties-corrected version of Spearman in R?
What is the best formula to date to compute it with a lot of ties?
Well, Kendall tau rank correlation is also a non-parametric test for statistical dependence between two ordinal (or rank-transformed) variables--like Spearman's, but unlike Spearman's, can handle ties.
More specifically, there are three Kendall tau statistics--tau-a, tau-b, and tau-c. tau-b is specifically adapted to handle ties.
The tau-b statistic handles ties (i.e., both members of the pair have the same ordinal value) by a divisor term, which represents the geometric mean between the number of pairs not tied on x and the number not tied on y.
Kendall's tau is not Spearman's--they are not the same, but they are also quite similar. You'll have to decide, based on context, whether the two are similar enough such one can be substituted for the other.
For instance, tau-b:
Kendall_tau_b = (P - Q) / ( (P + Q + Y0)*(P + Q + X0) )^0.5
P: number of concordant pairs ('concordant' means the ranks of each member of the pair of data points agree)
Q: number of discordant pairs
X0: number of pairs not tied on x
Y0: number of pairs not tied on y
There is in fact a variant of Spearman's rho that explicitly accounts for ties. In situations in which i needed a non-parametric rank correlation statistic, i have always chosen tau over rho. The reason is that rho sums the squared errors, whereas tau sums the absolute
discrepancies. Given that both tau and rho are competent statistics and we are left to choose, a linear penalty on discrepancies (tau) has always seemed to me, a more natural way to express rank correlation. That's not a recommendation, your context might be quite different and dictate otherwise.
I think exact=FALSE does the trick.
cor.test(c(1,2,3,4,5,6,7,8), c(0,0,0,0,0,0,7,8), method="spearman", exact=FALSE)
Spearman's rank correlation rho
data: c(1, 2, 3, 4, 5, 6, 7, 8) and c(0, 0, 0, 0, 0, 0, 7, 8)
S = 19.8439, p-value = 0.0274
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.7637626
cor.test with method="spearman" actually calculates Spearman coefficient corrected for ties.
I've checked it by "manually" calculating tie-corrected and tie-uncorrected Spearman coefficients from equations in Zar 1984, Biostatistical Analysis. Here's the code - just substitute your own variable names to check for yourself:
ym <- data.frame(lousy, dors) ## my data
## ranking variables
ym$l <- rank(ym$lousy)
ym$d <- rank(ym$dors)
## calculating squared differences between ranks
ym$d2d <- (ym$l-ym$d)^2
## calculating variables for equations 19.35 and 19.37 in Zar 1984
lice <- as.data.frame(table(ym$lousy))
lice$t <- lice$Freq^3-lice$Freq
dorsal <- as.data.frame(table(ym$dors))
dorsal$t <- dorsal$Freq^3-dorsal$Freq
n <- nrow(ym)
sum.d2 <- sum(ym$d2d)
Tx <- sum(lice$t)/12
Ty <-sum(dorsal$t)/12
## calculating the coefficients
rs1 <- 1 - (6*sum.d2/(n^3-n)) ## "standard" Spearman cor. coeff. (uncorrected for ties) - eq. 19.35
rs2 <- ((n^3-n)/6 - sum.d2 - Tx - Ty)/sqrt(((n^3-n)/6 - 2*Tx)*((n^3-n)/6 - 2*Ty)) ## Spearman cor.coeff. corrected for ties - eq.19.37
##comparing with cor.test function
cor.test(ym$lousy,ym$dors, method="spearman") ## cor.test gives tie-corrected coefficient!
Ties-corrected Spearman
Using method="spearman" gives you the ties-corrected Spearman. Spearman's rho, according to the definition, is simply the Pearson's sample correlation coefficient computed for ranks of sample data. So it works both in presence and in absence of ties.
You can see that after replacing your original data with their ranks (midranks for ties) and using method="pearson", you will get the same result:
> cor.test(rank(c(1,2,3,4,5,6,7,8)), rank(c(0,0,0,0,0,0,7,8)), method="pearson")
Pearson's product-moment correlation
data: rank(c(1, 2, 3, 4, 5, 6, 7, 8)) and rank(c(0, 0, 0, 0, 0, 0, 7, 8))
t = 2.8983, df = 6, p-value = 0.0274
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1279559 0.9546436
sample estimates:
cor
0.7637626
Notice, there exists a simplified no-ties Spearman version, that is in fact used in cor.test() implementation in absence of ties, but it is equivalent to the definition above.
P-value
In case of ties in data, exact p-values are not computed neither for Spearman nor for Kendall measures (within cor.test() implementation), hence the warning. As mentioned in Eduardo's post, for not to get a warning you should set exact=FALSE,
The paper "A new rank correlation coefficient with application to the consensus ranking problem" is aimed to solve the ranking with tie problem. It also mentions that Tau-b should not be used as a ranking correlation measure for measuring agreement between weak orderings.
Emond, E. J. and Mason, D. W. (2002), A new rank correlation coefficient with application to the consensus ranking problem. J. Multi‐Crit. Decis. Anal., 11: 17-28. doi:10.1002/mcda.313
I was having a similar problem and by reading the answers here and the help file on R I saw that, when you have ties, you have to add the parameter exact = FALSE) to the cor.test() function. By adding this, it does not try to calculate an exact P value, but instead "the test statistic is the estimate scaled to zero mean and unit variance, and is approximately normally distributed".
The result, in my case, was exactly the same, but without the warning about ties.
cor.test(x, y, method = "spearm", exact = FALSE)
The R package ConsRank contains an implementation of Edmon and Mason's Tau_X. This appears to be the (mathematically) best currently known method for handling ties.
See the docs, which give the usage as
Tau_X(X, Y=NULL)
where X can be a matrix.
As pointed out by #wibeasley, Emond and Mason (2002) proposed Tau_X, a new rank correlation coefficient which appears to be superior to Kendal's Tau-b. NelsonGon was concerned that the paper is from 2002, predating the question by a few years, but seems to have overlooked that Spearman's correlation dates from 1904, and Kendall's Tau from 1938.

Resources