How do I calculate AUC from two continuous variables in R? - r

I have the following data:
# actual value:
a <- c(26.77814,29.34224,10.39203,29.66659,20.79306,20.73860,22.71488,29.93678,10.14384,32.63233,24.82544,38.14778,25.12343,23.07767,14.60789)
# predicted value
p <- c(27.238142,27.492240,13.542026,32.266587,20.473063,20.508603,21.414882,28.536775,18.313844,32.082333,24.545438,30.877776,25.703430,22.397666,15.627892)
I already calculated MSE and RMSE for these two, but they're asking for AUC and ROC curve. How can I calculate it from this data using R? I thought AUC is for classification problems, was I mistaken? Can we still calculate AUC for numeric values like above?

Question:
I thought AUC is for classification problems, was I mistaken?
You are not mistaken. The area under the receiver operating characteristic curve can't be computed for two numeric vectors like in your example. It's used to determine how well your binary classifier stands up to a gold standard binary classifier. You need a vector of cases vs. controls, or levels for the a vector that put each value in one of two categories.
Here's an example of how you'd do this with the pROC package:
library(pROC)
# actual value
a <- c(26.77814,29.34224,10.39203,29.66659,20.79306,20.73860,22.71488,29.93678,10.14384,32.63233,24.82544,38.14778,25.12343,23.07767,14.60789)
# predicted value
p <- c(27.238142,27.492240,13.542026,32.266587,20.473063,20.508603,21.414882,28.536775,18.313844,32.082333,24.545438,30.877776,25.703430,22.397666,15.627892)
df <- data.frame(a = a, p = p)
# order the data frame according to the actual values
odf <- df[order(df$a),]
# convert the actual values to an ordered binary classification
odf$a <- odf$a > 12 # arbitrarily decided to use 12 as the threshold
# construct the roc object
roc_obj <- roc(odf$a, odf$p)
auc(roc_obj)
# Area under the curve: 0.9615
Here, we have arbitrarily decided that threshold for the gold standard (a) is 12. If that's the case, than observations that have a lower value than 12 are controls. The prediction (p) classifies very well, with an AUC of 0.9615. We don't have to decide on the threshold for our prediction classifier in order to determine the AUC, because it's independent of the threshold decision. We can slide up and down depending on whether it's more important to find cases or to not misclassify a control.
Important Note
I completely made up the threshold for the gold standard classifier. If you choose a different threshold (for the gold standard), you'll get a different AUC. For example, if we chose 28, the AUC would be 1. The AUC is independent of the threshold for the predictor, but absolutely depends on the threshold for the gold standard.
EDIT
To clarify the above note, which was apparently misunderstood, you were not mistaken. This kind of analysis is for classification problems. You cannot use it here without more information. In order to do it, you need a threshold for your a vector, which you don't have. You CAN'T make one up and expect to get a non made up result for the AUC. Because the AUC depends on the threshold for the gold standard classifier, if you just make up the threshold, as we did in the exercise above, you are also just making up the AUC.

Related

Is there a way to change threshold of a classification within a model in caret R?

I would like to change the threshold of the model and have comes across post like in the Cross Validated thread How to change threshold for classification in R randomForests?
If I change the threshold post creating a model that means I will again have to tweak things for test data or new data.
Is there a way in R & caret to change the threshold within the model so that I can run the same model with same threshold value on new data or test data as well?
In probabilistic classifiers, such as Random Forests, there is not any threshold involved during fitting of a model, neither there is any threshold associated with a fitted model; hence, there is actually nothing to change. As correctly pointed out in the CV thread Reduce Classification Probability Threshold:
Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.
Quoting from my own answer in Change threshold value for Random Forest classifier :
There is simply no threshold during model training; Random Forest is a probabilistic classifier, and it only outputs class probabilities. "Hard" classes (i.e. 0/1), which indeed require a threshold, are neither produced nor used in any stage of the model training - only during prediction, and even then only in the cases we indeed require a hard classification (not always the case). Please see Predict classes or class probabilities? for more details.
So, if you produce predictions from a fitted model, say rf, with the argument type = "prob", as shown in the CV thread you have linked to:
pred <- predict(rf, mydata, type = "prob")
these predictions will be probability values in [0, 1], and not hard classes 0/1. From here, you are free to choose the threshold as shown in the answer there, i.e.:
thresh <- 0.6 # any desired value in [0, 1]
class_pred <- c()
class_pred[pred <= thresh] <- 0
class_pred[pred > thresh] <- 1
or of course experiment with different values of threshold without needing to change anything in the model itself.

Why does survey weight change R SQUARED?

library(survival)
library(survminer)
library(dplyr)
ovarian=ovarian
ovarian$weighting = sample(1:100,26,replace=T)
fitWEIGHT <- coxph(Surv(futime, fustat) ~ age + rx,data=ovarian,weight=weighting)
fitNOWEIGHT <- coxph(Surv(futime, fustat) ~ age + rx,data=ovarian)
In this example above the value of the R-Squared for fitWEIGHT equals to 1. However the same model without fake sample weights has R-Squared equals to less than half (0.5). Why is this happening?
Weighting here is effectively repeating the observations. You're calculating weights with a perfectly distributed random sample ovarian$weighting = sample(1:100,26,replace=T) that's distributed across your underlying data set. So re-observing each sets of data points according to the normally distributed weights is likely biasing the function to ensure perfect correlation between your dependent and independent variables. It's probably not perfectly perfectly correlated, but the 1:100 range is likely blowing it out beyond the default number of significant digits and so it rounds to 1. If you change the sample to 1:10 or 40:50 or something it would likely continue to push the correlation bias but to reduce the r2 to nearly-1 instead of rounded-to-1 value that you're seeing now under the current weighting strategy.
For additional discussion on weights for this function see below. To ensure that the weights you're specifying are the types of weights you're expecting for this analysis. It's really weighting the observation count (ie, a form of over/re-sampling the observation you're assigning the weight to). https://www.rdocumentation.org/packages/survival/versions/2.43-3/topics/coxph
Where it states:
Case Weights Case weights are treated as replication weights, i.e., a
case weight of 2 is equivalent to having 2 copies of that subject's
observation. When computers were much smaller grouping like subjects
together was a common trick to used to conserve memory. Setting all
weights to 2 for instance will give the same coefficient estimate but
halve the variance. When the Efron approximation for ties (default) is
employed replication of the data will not give exactly the same
coefficients as the weights option, and in this case the weighted fit
is arguably the correct one.
When the model includes a cluster term or the robust=TRUE option the
computed variance treats any weights as sampling weights; setting all
weights to 2 will in this case give the same variance as weights of 1.

R: Limit/Set values of predicted results from linear model

New to R.
Looking to limit the range of values that can be predicted.
df.Train <- data.frame(S=c(1,2,2,2,1),L=c(1,2,3,3,1),M=c(400,450,400,700,795),V=c(423,400,555,600,800),G=c(4,3.2,2,2.7,3.4), stringsAsFactors=FALSE)
m.Train <- lm(G~S+L+M+V,data=df.Train)
df.Test <- data.frame(S=c(1,2,1,2,1),L=c(1,2,3,1,1),M=c(400,450,500,800,795),V=c(423,475,555,600,555), stringsAsFactors=FALSE)
round(predict(m.Train, df.Test, type="response"),digits=1)
#seq(0,4,.1) #Predicted values should fall in this range
I've experimented with the predict() options but no luck.
Is there an option in predict? Should I be limiting it in the model?
Thank you
There are ways to transform your response variable, G in this occasion but there needs to be a good reason to do this. For example, if you want the output to be probabilities between 0 and 1 and your response variable is binary (0,1) then you need a logistic regression.
It all comes down to what data you have and whether a model / transformation of the response variable would be appropriate. In your example you do not specify what the data is and therefore we cannot say anything about which model or which transformation to use.
Setting the above on the side, if you really care about the prediction and do not care about the model or the transformation (but why wouldn't you care?) it looks like your data could use a quasipossion generalised linear model which might provide the output you need:
df.Train <- data.frame(S=c(1,2,2,2,1),L=c(1,2,3,3,1),M=c(400,450,400,700,795),V=c(423,400,555,600,800),G=c(4,3.2,2,2.7,3.4), stringsAsFactors=FALSE)
m.Train <- glm(G~S+L+M+V,data=df.Train, family=quasipoisson)
df.Test <- data.frame(S=c(1,2,1,2,1),L=c(1,2,3,1,1),M=c(400,450,500,800,795),V=c(423,475,555,600,555), stringsAsFactors=FALSE)
> predict(m.Train, df.Test, type="response")
1 2 3 4 5
4.000000 2.840834 3.062754 3.615447 4.573276
#probably not as good as you want
The model is using a log link by default which ensures the values will be positive. There is no guarantee that the model will not predict values greater than 4 but since you fed it values of less than 4 (your G variable) then chances are that most of the predictions will follow that distribution (like in this example). You might then need to consider how to treat predictions that go above 4.
In general you should consider carefully which model to choose and which response transformation. The poison model above for example is usually used for count data. However, you should never manipulate predictions on your own so if you choose the lm model in the end make sure you use the predictions it gives.
EDIT
It looks like in your case a non-linear regression might be what you need. The problem using a linear model like lm is that predictions can be greater than the max of the observed cases and less than the min of the observed cases. In which case doing a linear regression might not be appropriate. There are algorithms that will never predict a value greater than the max or less than the min. Such a case might be better suited in your case. One of these algorithms is the k-nearest neighbor for example:
library(FNN)
> knn.reg(df.Train[1:4], test=df.Test[1:4], y=df.Train[5], k=3)
Prediction:
[1] 3.066667 3.066667 3.066667 2.700000 3.100000
As you can see the predictions will never go above 4. That said knn is a local solution algorithm so again you need to research whether this is a good approach or not for your problem and your data. In terms of predictions though it definitely confirms your conditions. Knn is a very easy to understand algorithm that relies on distances between points to calculate predictions.
Hope it helps :)

Variable sample size per cluster/group in mixed effects logistic regression

I am attempting to run mixed effects logistic regression models, yet am concerned about the variable samples sizes in each cluster/group, and also the very low number of "successes" in some models.
I have ~ 700 trees distributed across 163 field plots (i.e., the cluster/group), visited annually from 2004-11. I am fitting separate mixed effects logistic regression models (hereafter GLMMs) for each year of the study to compare this output to inference from a shared frailty model (i.e., survival analysis with random effect).
The number of trees per plot varies from 1-22. Also, some years have a very low number of "successes" (i.e., diseased trees). For example, in 2011 there were only 4 successes out of 694 "failures" (i.e., healthy trees).
My questions are: (1) is there a general rule for the ideal number of samples|group when the inference focus is only on estimating the fixed effects in the GLMM, and (2) are GLMMs stable when there is such an extreme difference in the ratio of successes:failures.
Thank you for any advice or suggestions of sources.
-Sarah
(Hi, Sarah, sorry I didn't answer previously via e-mail ...)
It's hard to answer these questions in general -- you're stuck
with your data, right? So it's not a question of power analysis.
If you want to make sure that your results will be reasonably
reliable, probably the best thing to do is to run some simulations.
I'm going to show off a fairly recent feature of lme4 (in the
development version 1.1-1, on Github), which is to simulate
data from a GLMM given a formula and a set of parameters.
First I have to simulate the predictor variables (you wouldn't
have to do this, since you already have the data -- although
you might want to try varying the range of number of plots,
trees per plot, etc.).
set.seed(101)
## simulate number of trees per plot
## want mean of 700/163=4.3 trees, range=1-22
## by trial and error this is about right
r1 <- rnbinom(163,mu=3.3,size=2)+1
## generate plots and trees within plots
d <- data.frame(plot=factor(rep(1:163,r1)),
tree=factor(unlist(lapply(r1,seq))))
## expand by year
library(plyr)
d2 <- ddply(d,c("plot","tree"),
transform,year=factor(2004:2011))
Now set up the parameters: I'm going to assume year is a fixed
effect and that overall disease incidence is plogis(-2)=0.12 except
in 2011 when it is plogis(-2-3)=0.0067. The among-plot standard deviation
is 1 (on the logit scale), as is the among-tree-within-plot standard
deviation:
beta <- c(-2,0,0,0,0,0,0,-3)
theta <- c(1,1) ## sd by plot and plot:tree
Now simulate: year as fixed effect, plot and tree-within-plot as
random effects
library(lme4)
s1 <- simulate(~year+(1|plot/tree),family=binomial,
newdata=d2,newparams=list(beta=beta,theta=theta))
d2$diseased <- s1[[1]]
Summarize/check:
d2sum <- ddply(d2,c("year","plot"),
summarise,
n=length(tree),
nDis=sum(diseased),
propDis=nDis/n)
library(ggplot2)
library(Hmisc) ## for mean_cl_boot
theme_set(theme_bw())
ggplot(d2sum,aes(x=year,y=propDis))+geom_point(aes(size=n),alpha=0.3)+
stat_summary(fun.data=mean_cl_boot,colour="red")
Now fit the model:
g1 <- glmer(diseased~year+(1|plot/tree),family=binomial,
data=d2)
fixef(g1)
You can try this many times and see how often the results are reliable ...
As Josh said, this is a better questions for CrossValidated.
There are no hard and fast rules for logistic regression, but one rule of thumb is 10 successes and 10 failures are needed per cell in the design (cluster in this case) times the number continuous variables in the model.
In your case, I would think the model, if it converges, would be unstable. You can examine that by bootstrapping the errors of the estimates of the fixed effects.

estimating density in a multidimensional space with R

I have two types of individuals, say M and F, each described with six variables (forming a 6D space S). I would like to identify the regions in S where the densities of M and F differ maximally. I first tried a logistic binomial model linking F/ M to the six variables but the result of this GLM model is very hard to interpret (in part due to the numerous significant interaction terms). Thus I am thinking to an “spatial” analysis where I would separately estimate the density of M and F individuals everywhere in S, then calculating the difference in densities. Eventually I would manually look for the largest difference in densities, and extract the values at the 6 variables.
I found the function sm.density in the package sm that can estimate densities in a 3d space, but I find nothing for a space with n>3. Would you know something that would manage to do this in R? Alternatively, would have a more elegant method to answer my first question (2nd sentence)?
In advance,
Thanks a lot for your help
The function kde of the package ks performs kernel density estimation for multinomial data with dimensions ranging from 1 to 6.
pdfCluster and np packages propose functions to perform kernel density estimation in higher dimension.
If you prefer parametric techniques, you look at R packages doing gaussian mixture estimation like mclust or mixtools.
The ability to do this with GLM models may be constrained both by interpretablity issues that you already encountered as well as by numerical stability issues. Furthermore, you don't describe the GLM models, so it's not possible to see whether you include consideration of non-linearity. If you have lots of data, you might consider using 2D crossed spline terms. (These are not really density estimates.) If I were doing initial exploration with facilities in the rms/Hmisc packages in five dimensions it might look like:
library(rms)
dd <- datadist(dat)
options(datadist="dd")
big.mod <- lrm( MF ~ ( rcs(var1, 3) + # `lrm` is logistic regression in rms
rcs(var2, 3) +
rcs(var3, 3) +
rcs(var4, 3) +
rcs(var5, 3) )^2,# all 2way interactions
data=dat,
max.iter=50) # these fits may take longer times
bplot( Predict(bid.mod, var1,var2, n=10) )
That should show the simultaneous functional form of var1's and var2's contribution to the "5 dimensional" model estimates at 10 points each and at the median value of the three other variables.

Resources