Implementing a GLM in R with weights.
I am wondering whether this quote from R Documentation of GLM
Non-NULL weights can be used to indicate that different observations have different dispersions (with the values in weights being inversely proportional to the dispersions); or equivalently, when the elements of weights are positive integers , that each response is the mean of unit-weight observations.
means that the log-likelihood is modified in the following way
\sum_{i} \log f(X_i) \to \sum_{i} w_i \log f(X_i)
only when the weights are positive integers?
If this is not possible at all then how do I incorporate weights in the way as described in the formula?
I'm an R newbie. I'm using "clogitL1" to run a regularized conditional logistic regression for a matched case-control study with 1021 independent variables (metabolites). I'm not able to extract the regression coefficients. I've tried summary(x), coef(x), coefficient(x), x$beta - none of them work. I'm able to run it OK, and if I follow it with "cv.clogitL1" I can extract the cross-validated estimated coefficients, but not the estimated coefficients for the original model. Here's some of my code:
strata=sort(data.meta$MATCHED_NEW)
condlog <- clogitL1(y=data.meta$BCR, x=data.meta$ln_metab[, data.features ], strata,
numLambda=100, minLambdaRatio=0.000001, alpha = 1.0)
"strata" is a vector indicating pairing of cases & controls.
"data.meta$BCR" is a vector inicating case or control status
"data.meta$ln_metab" is a matrix with observations as rows and metabolite levels as columns
"data.features" is a vector indicating which metabolites passed several dimension reduction filters.
Appreciate any suggestions.
library(survival)
library(survminer)
library(dplyr)
ovarian=ovarian
ovarian$weighting = sample(1:100,26,replace=T)
fitWEIGHT <- coxph(Surv(futime, fustat) ~ age + rx,data=ovarian,weight=weighting)
fitNOWEIGHT <- coxph(Surv(futime, fustat) ~ age + rx,data=ovarian)
In this example above the value of the R-Squared for fitWEIGHT equals to 1. However the same model without fake sample weights has R-Squared equals to less than half (0.5). Why is this happening?
Weighting here is effectively repeating the observations. You're calculating weights with a perfectly distributed random sample ovarian$weighting = sample(1:100,26,replace=T) that's distributed across your underlying data set. So re-observing each sets of data points according to the normally distributed weights is likely biasing the function to ensure perfect correlation between your dependent and independent variables. It's probably not perfectly perfectly correlated, but the 1:100 range is likely blowing it out beyond the default number of significant digits and so it rounds to 1. If you change the sample to 1:10 or 40:50 or something it would likely continue to push the correlation bias but to reduce the r2 to nearly-1 instead of rounded-to-1 value that you're seeing now under the current weighting strategy.
For additional discussion on weights for this function see below. To ensure that the weights you're specifying are the types of weights you're expecting for this analysis. It's really weighting the observation count (ie, a form of over/re-sampling the observation you're assigning the weight to). https://www.rdocumentation.org/packages/survival/versions/2.43-3/topics/coxph
Where it states:
Case Weights Case weights are treated as replication weights, i.e., a
case weight of 2 is equivalent to having 2 copies of that subject's
observation. When computers were much smaller grouping like subjects
together was a common trick to used to conserve memory. Setting all
weights to 2 for instance will give the same coefficient estimate but
halve the variance. When the Efron approximation for ties (default) is
employed replication of the data will not give exactly the same
coefficients as the weights option, and in this case the weighted fit
is arguably the correct one.
When the model includes a cluster term or the robust=TRUE option the
computed variance treats any weights as sampling weights; setting all
weights to 2 will in this case give the same variance as weights of 1.
I have the following data:
# actual value:
a <- c(26.77814,29.34224,10.39203,29.66659,20.79306,20.73860,22.71488,29.93678,10.14384,32.63233,24.82544,38.14778,25.12343,23.07767,14.60789)
# predicted value
p <- c(27.238142,27.492240,13.542026,32.266587,20.473063,20.508603,21.414882,28.536775,18.313844,32.082333,24.545438,30.877776,25.703430,22.397666,15.627892)
I already calculated MSE and RMSE for these two, but they're asking for AUC and ROC curve. How can I calculate it from this data using R? I thought AUC is for classification problems, was I mistaken? Can we still calculate AUC for numeric values like above?
Question:
I thought AUC is for classification problems, was I mistaken?
You are not mistaken. The area under the receiver operating characteristic curve can't be computed for two numeric vectors like in your example. It's used to determine how well your binary classifier stands up to a gold standard binary classifier. You need a vector of cases vs. controls, or levels for the a vector that put each value in one of two categories.
Here's an example of how you'd do this with the pROC package:
library(pROC)
# actual value
a <- c(26.77814,29.34224,10.39203,29.66659,20.79306,20.73860,22.71488,29.93678,10.14384,32.63233,24.82544,38.14778,25.12343,23.07767,14.60789)
# predicted value
p <- c(27.238142,27.492240,13.542026,32.266587,20.473063,20.508603,21.414882,28.536775,18.313844,32.082333,24.545438,30.877776,25.703430,22.397666,15.627892)
df <- data.frame(a = a, p = p)
# order the data frame according to the actual values
odf <- df[order(df$a),]
# convert the actual values to an ordered binary classification
odf$a <- odf$a > 12 # arbitrarily decided to use 12 as the threshold
# construct the roc object
roc_obj <- roc(odf$a, odf$p)
auc(roc_obj)
# Area under the curve: 0.9615
Here, we have arbitrarily decided that threshold for the gold standard (a) is 12. If that's the case, than observations that have a lower value than 12 are controls. The prediction (p) classifies very well, with an AUC of 0.9615. We don't have to decide on the threshold for our prediction classifier in order to determine the AUC, because it's independent of the threshold decision. We can slide up and down depending on whether it's more important to find cases or to not misclassify a control.
Important Note
I completely made up the threshold for the gold standard classifier. If you choose a different threshold (for the gold standard), you'll get a different AUC. For example, if we chose 28, the AUC would be 1. The AUC is independent of the threshold for the predictor, but absolutely depends on the threshold for the gold standard.
EDIT
To clarify the above note, which was apparently misunderstood, you were not mistaken. This kind of analysis is for classification problems. You cannot use it here without more information. In order to do it, you need a threshold for your a vector, which you don't have. You CAN'T make one up and expect to get a non made up result for the AUC. Because the AUC depends on the threshold for the gold standard classifier, if you just make up the threshold, as we did in the exercise above, you are also just making up the AUC.
I would really appreciate any help with specifying probability weights in R without using the Lumley survey package. I am conducting mediation analysis in R using the Imai et al mediation package, which does not currently support svyglm.
The code I am currently running is:
olsmediator_basic<-lm(poledu ~ gateway_strict_alt + gender_n + spline1 + spline2 + spline3,
data = unifiedanalysis, weights = designweight).
However, I'm unsure if this is weighting the data correctly. The reason is that this code yields standard errors that differ from those I am getting in Stata. The Stata code I am running is:
reg poledu gateway_strict_alt gender_n spline1 spline2 spline3 [pweight=designweight]).
I was wondering if the weights option in R may not be for inverse probability weights, but I was unable to determine this from the documentation, this forum or elsewhere. If I am missing something, I really apologize - I am new to R as well as to this forum.
Thank you in advance for your help.
The R documentation specifies that the weights parameter of the lm function is inversely proportional to the variance of the observations. This is the definition of analytic weights, or aweights in Stata.
Have a look at the ipw package for inverse probability weighting.
To correct a previous answer - I looked up the manual on weights and found the following description for weights in lm
Non-NULL weights can be used to indicate that different observations have different variances (with the values in weights being inversely proportional to the variances); or equivalently, when the elements of weights are positive integers w_i, that each response y_i is the mean of w_i unit-weight observations (including the case that there are w_i observations equal to y_i and the data have been summarized).
These are actually frequency weights (fweights in stata). They multiply out the observation n number of times as defined by the weight vector. Probability weights, on the other hand, refer to the probability that observations group is included in the population. Doing so adjusts the impact of the observation on the coefficients, but not on the standard errors, as they don't change the number of observations represented in the sample.