Calculating a correlation coefficient that includes missing values - r

I'm looking to calculate some form of correlation coefficient in R (or any common stats package actually) in which the value of the correlation is influenced by missing values. I am not sure if this is possible and am looking for a method. I do not want to impute data, but actually want the correlation to be reduced based on the number of incomplete cases included in some systematic fashion. The data are a series of time points generated by different individuals and the correlation coefficient is being used to compute reliability. In many cases, one individual's data will include several more time points than the other individual...
Again, not sure if there is any standard procedure for dealing with such a situation.

One thing to look at is fitting a logistic regression to whether or not a point is missing. If there is no relationship then that provides support for assuming that the missing values won't provide any information. If that is your case then you won't have to impute anything and can just perform your computation without the missing values. glm in R can be used for logistic regression.
Also on a different note, see the use="pairwise.complete.obs" argument to cor which may or may not apply to you.
EDIT: I have revised this answer based on rereading the question.

My feeling is that when there is a datapair that has one of the timeseries showing NA, that pair cannot be used for calculating a correlation as there is no information at that point. As there is no information on that point, there is no way to know how it would influence the correlation. Specifying that an NA reduces the correlation seems tricky, if an observation would be present at a point this could just as easily have improved the correlation.
Default behavior in R is to return NA for the correlation if there is an NA present. This behavior can be tweaked using the 'use' argument. See the documentation of that function for more details.

As pointed out in the answer by Paul Hiemstra, there is no way of knowing whether the correlation would have been higher or lower without missing values. However, for some applications it may be appropriate to penalize the observed correlation for non-matching missing values. For example, if we compare two individual coders, we may want coder B to say "NA" if and only if coder A says "NA" as well, plus we want their non-NA values to correlate.
Under these assumptions, a simple way to penalize non-matching missing values is to compute correlation for complete cases and multiply by the proportion of observations that are matched in terms of their NA-status. The penalty term can then be defined as: 1 - mean((is.na(coderA) & !is.na(coderB)) | (!is.na(coderA) & is.na(coderB))). A simple illustration follows.
fun = function(x1, x2, idx_rm) {
temp = x2
# remove 'idx_rm' points from x2
temp[idx_rm] = NA
# calculate correlations
r_full = round(cor(x1, x2, use = 'pairwise.complete.obs'), 2)
r_NA = round(cor(x1, temp, use = 'pairwise.complete.obs'), 2)
penalty = 1 - mean((is.na(temp) & !is.na(x1)) |
(!is.na(temp) & is.na(x1)))
r_pen = round(r_NA * penalty, 2)
# plot
plot(x1, temp, main = paste('r_full =', r_full,
'; r_NA =', r_NA,
'; r_pen =', r_pen),
xlim = c(-4, 4), ylim = c(-4, 4), ylab = 'x2')
points(x1[idx_rm], x2[idx_rm], col = 'red', pch = 16)
regr_full = as.numeric(summary(lm(x2 ~ x1))$coef[, 1])
regr_NA = as.numeric(summary(lm(temp ~ x1))$coef[, 1])
abline(regr_full[1], regr_full[2])
abline(regr_NA[1], regr_NA[2], lty = 2)
}
Run a simple simulation to illustrate the possible effects of missing values and penalization:
set.seed(928)
x1 = rnorm(20)
x2 = x1 * rnorm(20, mean = 1, sd = .8)
# A case when NA's artifically inflate the correlation,
# so penalization makes sense:
myfun(x1, x2, idx_rm = c(13, 19))
# A case when NA's DEflate the correlation,
# so penalization may be misleading:
myfun(x1, x2, idx_rm = c(6, 14))
# When there are a lot of NA's, penalization is much stronger
myfun(x1, x2, idx_rm = 7:20)
# Some NA's in x1:
x1[1:5] = NA
myfun(x1, x2, idx_rm = c(6, 14))

Related

Posterior distribution missing from plots

I'm trying to use R to calculate a posterior distribution and produce a triplot gragh for my prior, likelihood and posterior distribution. I have the prior distribution π_1 (θ) = Be (1.5, 1.5).
Here is my R code:
n <- 25
X <- 16
a <- 1.5
b <- 1.5
grid <- seq(0,1,.01)
like <- dbinom(X,n,grid)
like
like <- like/sum(like)
like
prior <- dbeta(grid,a,b)
prior1 <- prior/sum(prior)
post <- like*prior
post <- post/sum(post)
It does give me a Triplot but I also want to get the value for my posterior distribution, but it seems something missing in my code.
To clarify, I am looking for the posterior distribution of θ for the above prior distribution
In addition, I have tried:
install.packages("LearnBayes")
library("LearnBayes")
prior = c( a= 1.5, b = 1.5 )
data = c( s = 25, f = 16 )
triplot(prior,data)
It gives me a perfect Triplot, but again no value for posterior.
It's there, but just that the prior is so weakly informative (Beta[a=1.5, b=1.5] is nearly uniform) that the likelihood function differs very little from the posterior. An intuitive way to think about this is that a+b-2 is 1, meaning the prior is effectively only supported by 1 previous observation, whereas N is 25, meaning the data is supported by 25 observations. This leads to the data dominating the posterior in terms of contributing information.
Changing the prior to be stronger will make the difference more apparent:
prior <- c(a=10, b=10)
data <- c(s=25, f=16)
triplot(prior, data)
Note, there is nothing wrong with using a weakly informative prior, if that is all the information that is available. When the observed data is large enough, it should dominate the posterior.

Lasso regression glmnet assigning Y value.

Okay so I'm doing a LASSO regression but I'm having problems with my Y term.
I know my X has to be a matrix and the y's have to be numeric.
This is the case in my set. However I feel my model does not run properly. I first show you what I did and then what I think should be done (but no idea how to do it).
So what I did is as follows. I used the nuclear dataset from R for this example.
library(boot)
data("nuclear")
attach(nuclear)
nuclear <- as.matrix(nuclear)
So I converted it to a matrix. And then I used my matrix on x and y.
CV = cv.glmnet(x=nuclear,y=nuclear, family="multinomial", type.measure = "class", alpha = 1, nlambda = 100)
However i feel my Y-axis is not correct. I feel somehow my dependent variable should be there. But how do I get it there? Assume that nuclear$pt is my dependent variable. Putting nuclear$pt for Y does not work.
plot(CV)
fit = glmnet(x=nuclear, y=nuclear, family = "multinomial" , alpha=1, lambda=CV$lambda.1se)
If i then run this it feels my model didn't run at all. Probably something miss with my Y but i can't put my finger on it.
You used the same matrix for x and y. You have to separate the independent and dependent variables somehow. For example, you can use indices to select the variables:
cv.glmnet(x=nuclear[, 1:10],y=nuclear[, 11], family="binomial",
type.measure = "class", alpha = 1, nlambda = 100)
This will use the first 10 columns of nuclear as independent variables and the 11th column as dependent variable.

Bug in R e1071 Naive Bayes?

I have no experience in the R community, so please point me somewhere else if this is not the appropriate forum...
Long story short, I'm afraid that e1071::naiveBayes favors giving labels by alphabetical order.
In an earlier question here I had noticed some strange behavior with numerical predictors in the e1071 implementation of naive Bayes. While I got a more reasonable answer, some probabilities seemed biased upwards.
Can anyone shed any light on why this simulation would end up like this? I can only imagine that it is a bug at this point...
library(e1071)
# get a data frame with numObs rows, and numDistinctLabels possible labels
# each label is randomly drawn from letters a-z
# each label has its own distribution of a numeric variable
# this is normal(i*100, 10), i in 1:numDistinctLabels
# so, if labels are t, m, and q, t is normal(100, 10), m is normal(200, 10), etc
# the idea is that all labels should be predicted just as often
# but it seems that "a" will be predicted most, "b" second, etc
doExperiment = function(numObs, numDistinctLabels){
possibleLabels = sample(letters, numDistinctLabels, replace=F)
someFrame = data.frame(
x=rep(NA, numObs),
label=rep(NA, numObs)
)
numObsPerLabel = numObs / numDistinctLabels
for(i in 1:length(possibleLabels)){
label = possibleLabels[i]
whichAreNA = which(is.na(someFrame$label))
whichToSet = sample(whichAreNA, numObsPerLabel, replace=F)
someFrame[whichToSet, "label"] = label
someFrame[whichToSet, "x"] = rnorm(numObsPerLabel, 100*i, 10)
}
someFrame = as.data.frame(unclass(someFrame))
fit = e1071::naiveBayes(label ~ x, someFrame)
# The threshold argument doesn't seem to change the matter...
someFrame$predictions = predict(fit, someFrame, threshold=0)
someFrame
}
# given a labeled frame, return the label that was predicted most
getMostFrequentPrediction = function(labeledFrame){
names(which.max(sort(table(labeledFrame$prediction))))
}
# run the experiment a few thousand times
mostPredictedClasses = sapply(1:2000, function(x) getMostFrequentPrediction(doExperiment(100, 5)))
# make a bar chart of the most frequently predicted labels
plot(table(mostPredictedClasses))
This gives a plot like:
Giving every label the same normal distribution (i.e. mean 100, stdev 10) gives:
Regarding confusion in comment:
This is maybe getting away from Stack Overflow territory here, but anyways...
While I would expect classification to be less clumpy, the effect of the standard deviations does a lot to flatten out the pdfs, and you can observe if you do this enough that one or two actually tend to dominate (red and black in this case).
Too bad we can't exploit the knowledge that the standard deviation is the same for all of them.
If you add just a little noise to the mean it becomes much more evenly distributed, even though there's still some misclassification.
The problem is not naiveBayes, it's your getMostFrequentPrediction function. You are returning only one value even when there are ties for first. Since you are using table(), the counts are being implicitly sorted alphabetically in the table. So when you grab the first max value, it will also be the "smallest" alphabetically speaking. So if you dun this a bunch of times:
getMostFrequentPrediction(data.frame(predictions=sample(rep(letters[1:3], 5))))
you will always get "a" even though the letters "a" "b" and "c" all appear 5 times.
If you want to randomly choose one of the most frequently predicted categories, here's another possible implementation
getMostFrequentPrediction = function(labeledFrame){
tt<-table(labeledFrame$predictions)
names(sample(tt[tt==max(tt)], 1))
}
This gives

coin::wilcox_test versus wilcox.test in R

In trying to figure out which one is better to use I have come across two issues.
1) The W statistic given by wilcox.test is different from that of coin::wilcox_test. Here's my output:
wilcox_test:
Exact Wilcoxon Mann-Whitney Rank Sum Test
data: data$variableX by data$group (yes, no)
Z = -0.7636, p-value = 0.4489
alternative hypothesis: true mu is not equal to 0
wilcox.test:
Wilcoxon rank sum test with continuity correction
data: data$variable by data$group
W = 677.5, p-value = 0.448
alternative hypothesis: true location shift is not equal to 0
I'm aware that there's actually two values for W and that the smaller one is usually reported. When wilcox.test is used with comma instead of "~" I can get the other value, but this comes up as W = 834.5. From what I understand, coin::statistic() can return three different statistics using ("linear", "standarized", and "test") where "linear" is the normal W and "standardized" is just the W converted to a z-score. None of these match up to the W I get from wilcox.test though (linear = 1055.5, standardized = 0.7636288, test = -0.7636288). Any ideas what's going on?
2) I like the options in wilcox_test for "distribution" and "ties.method", but it seems that you can not apply a continuity correction like in wilcox.test. Am I right?
I encountered the same issue when trying to apply Wendt formula to compute effect sizes using the coin package, and obtained aberrant r values due to the fact that the linear statistic outputted by wilcox_test() is unadjusted.
A great explanation is already given here, and therefore I will simply address how to obtain adjusted U statistics with the wilcox_test() function. Let's use a the following data frame:
d <- data.frame( x = c(rnorm(n = 60, mean = 10, sd = 5), rnorm(n = 30, mean = 16, sd = 5)),
g = c(rep("a",times = 60), rep("b",times = 30)) )
We can perform identical tests with wilcox.test() and wilcox_test():
w1 <- wilcox.test( formula = x ~ g, data = d )
w2 <- wilcox_test( formula = x ~ g, data = d )
Which will output two distinct statistics:
> w1$statistic
W
321
> w2#statistic#linearstatistic
[1] 2151
The values are indeed totally different (albeit the tests are equivalent).
To obtain the U statistics identical to that of wilcox.test(), you need to subtract wilcox_test()'s output statistic by the minimal value that the sum of the ranks of the reference sample can take, which is n_1(n_1+1)/2.
Both commands take the first level in the factor of your grouping variable g as reference (which will by default be alphabetically ordered).
Then you can compute the smallest sum of the ranks possible for the reference sample:
n1 <- table(w2#statistic#x)[1]
And
w2#statistic#linearstatistic- n1*(n1+1)/2 == w1$statistic
should return TRUE
Voilà.
It seems to be one is performing Mann-Whitney's U and the other Wilcoxon rank test, which is defined in many different ways in literature. They are pretty much equivalent, just look at the p-value. If you want continuity correction in wilcox.test just use argument correct=T.
Check https://stats.stackexchange.com/questions/79843/is-the-w-statistic-outputted-by-wilcox-test-in-r-the-same-as-the-u-statistic

geom_smooth on a subset of data

Here is some data and a plot:
set.seed(18)
data = data.frame(y=c(rep(0:1,3),rnorm(18,mean=0.5,sd=0.1)),colour=rep(1:2,12),x=rep(1:4,each=6))
ggplot(data,aes(x=x,y=y,colour=factor(colour)))+geom_point()+ geom_smooth(method='lm',formula=y~x,se=F)
As you can see the linear regression is highly influenced by the values where x=1.
Can I get linear regressions calculated for x >= 2 but display the values for x=1 (y equals either 0 or 1).
The resulting graph would be exactly the same except for the linear regressions. They would not "suffer" from the influence of the values on abscisse = 1
It's as simple as geom_smooth(data=subset(data, x >= 2), ...). It's not important if this plot is just for yourself, but realize that something like this would be misleading to others if you don't include a mention of how the regression was performed. I'd recommend changing transparency of the points excluded.
ggplot(data,aes(x=x,y=y,colour=factor(colour)))+
geom_point(data=subset(data, x >= 2)) + geom_point(data=subset(data, x < 2), alpha=.2) +
geom_smooth(data=subset(data, x >= 2), method='lm',formula=y~x,se=F)
The regular lm function has a weights argument which you can use to assign a weight to a particular observation. In this way you can plain with the influence which the observation has on the outcome. I think this is a general way of dealing with the problem in stead of subsetting the data. Of course, assigning weights ad hoc does not bode well for the statistical soundness of the analysis. It is always best to have a rationale behind the weights, e.g. low weight observations have a higher uncertainty.
I think under the hood ggplot2 uses the lm function so you should be able to pass the weights argument. You can add the weights through the aesthetic (aes), assuming that the weight is stored in a vector:
ggplot(data,aes(x=x,y=y,colour=factor(colour))) +
geom_point()+ stat_smooth(aes(weight = runif(nrow(data))), method='lm')
you could also put weight in a column in the dataset:
ggplot(data,aes(x=x,y=y,colour=factor(colour))) +
geom_point()+ stat_smooth(aes(weight = weight), method='lm')
where the column is called weight.
I tried #Matthew Plourde's solution, but subset did not work for me. It did not change anything when I used the subset compared to the original data. I replaced subset with filter and it worked:
ggplot(data,aes(x=x,y=y,colour=factor(colour)))+
geom_point(data=data[data$x >= 2,]) + geom_point(data=data[data$x < 2,], alpha=.2) +
geom_smooth(data=data[data$x >= 2,], method='lm',formula=y~x,se=F)

Resources