R multiclass/multinomial classification ROC using multiclass.roc (Package ‘pROC’) - r

I am having difficulties understanding how the multiclass.roc parameters should look like.
Here a snapshot of my data:
> head(testing.logist$cut.rank)
[1] 3 3 3 3 1 3
Levels: 1 2 3
> head(mnm.predict.test.probs)
1 2 3
9 1.013755e-04 3.713862e-02 0.96276001
10 1.904435e-11 3.153587e-02 0.96846413
12 6.445101e-23 1.119782e-11 1.00000000
13 1.238355e-04 2.882145e-02 0.97105472
22 9.027254e-01 7.259787e-07 0.09727389
26 1.365667e-01 4.034372e-01 0.45999610
>
I tried calling multiclass.roc with:
multiclass.roc(
response=testing.logist$cut.rank,
predictor=mnm.predict.test.probs,
formula=response~predictor
)
but naturally I get an error:
Error in roc.default(response, predictor, levels = X, percent = percent, :
Predictor must be numeric or ordered.
When it's a binary classification problem I know that 'predictor' should contain probabilities (one per observation). However, in my case, I have 3 classes, so my predictor is a list of rows that each have 3 columns (or a sublist of 3 values) correspond to the probability for each class.
Does anyone know how should my 'predictor' should look like rather than what it's currently look like ?

The pROC package is not really designed to handle this case where you get multiple predictions (as probabilities for each class). Typically you would assess your P(class = 1)
multiclass.roc(
response=testing.logist$cut.rank,
predictor=mnm.predict.test.probs[,1])
And then do it again with P(class = 2) and P(class = 3). Or better, determine the most likely class:
predicted.class <- apply(mnm.predict.test.probs, 1, which.max)
multiclass.roc(
response=testing.logist$cut.rank,
predictor=predicted.class)
Consider multiclass.roc as a toy that can sometimes be helpful but most likely won't really fit your needs.

Related

interpolate missing values with respect to step column in R

I have a panel data that is designed for survival model.
Some observations have missing data. however, the intervals are not constant.
Here is example of it:
t
value
5
5
10
8
15
12
18
NA
20
3
25
9
30
15
35
21
As you can see t intervals are 5 units. However we have a record that t is 18 and it is missing the value. I want to interpolate the values column with respect to the column t in R.
Do you have any suggestion?
It would be better if the method can support non-linear interpolation.
P.S.
The data is relatively huge, so generating a panel with small steps is not possible with my hardware.
I can handel python as well, but R is more convenient as the interpolation happens mid analysis in R.
In R
Use na_interpolation and pass the desired parameters, such as
library(imputeTS)
df$value <- na_interpolation(df$value, option = "spline")
In Python
Use pandas.Series.interpolate and pass various methods as follows, such as quadratic
import pandas as pd
df['value'] = df['value'].interpolate(method = 'quadratic', limit_direction = 'both')
Use sklearn.impute.KNNImputer, such as
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors = 2, weights = 'distance')
df['value'] = imputer.fit_transform(df[['value']])

inputs of 2 separate predict() return the same set of fitted values

Confession: I attempted to ask this question yesterday, but used a sample, congruent dataset which resembles the my "real" data in hopes this would be more convenient for readers here. One issue was resolved, but another remains that appears immutable.
My objective is creating a linear model of two predicted vectors: "yC.hat", and "yT.hat" which are meant to project average effects for unique observed values of pri2000v as a function of the average poverty level "I(avgpoverty^2) under control (treatment = 0) and treatment (treatment = 1) conditions.
While I appear to have no issues running the regression itself, the inputs of my data argument have no effect on predict(), and only the object itself affects the output. As a result, treatment = 0 and treatment = 1 in the data argument result in the same fitted values. In fact, I can plug in ANY value into the data argument and it makes do difference. So I suspect my failure to understand issue starts here.
Here is my code:
q6rega <- lm(pri2000v ~ treatment + I(log(pobtot1994)) + I(avgpoverty^2)
#interactions
+ treatment:avgpoverty + treatment:I(avgpoverty^2), data = pga)
## predicted PRI support under the Treatment condition
q6.yT.hat <- predict(q6rega,
data = data.frame(I(avgpoverty^2) = 9:25, treatment = 1))
## predicted PRI support rate under the Control condition
q6.yC.hat <- predict(q6rega,
data = data.frame(I(avgpoverty^2) = 9:25, treatment = 0))
q6.yC.hat == q6.yT.hat
TRUE[417]
dput(pga has been posted on my github, if needed
EDIT: There were a few things wrong with my code above, but not specifying pobtot1994 somehow resulted in R treating it as newdata being omitted. Since I'm fairly new to statistics, I confused fitted values with the prediction output that I was actually trying to achieve. I would have expected that an unexpected input is to produce an error instead.
I'm surprised you are able to run a prediction when it is lacking the required variable (pobtot1994) for your model in the new data frame for prediction.
Anyway, you would need to create a new data frame with the three variables in untransformed form used in the model. Since you are interested to compare the fitted values of avgpoverty 3 to 5 for treatment 1 and 0, you need to force the third variable pobtot1994 as a constant. I use the mean of pobtot9994 here for simplicity.
newdat <- expand.grid(avgpoverty=3:5, treatment=factor(c(0,1)), pobtot1994=mean(pga$pobtot1994))
avgpoverty treatment pobtot1994
1 3 0 2037.384
2 4 0 2037.384
3 5 0 2037.384
4 3 1 2037.384
5 4 1 2037.384
6 5 1 2037.384
The prediction will show you the different values for the two conditions.
newdat$fitted <- predict(q6rega, newdata=newdat)
avgpoverty treatment pobtot1994 fitted
1 3 0 2037.384 38.86817
2 4 0 2037.384 50.77476
3 5 0 2037.384 55.67832
4 3 1 2037.384 51.55077
5 4 1 2037.384 49.03148
6 5 1 2037.384 59.73910

Adjusted survival curve based on weigthed cox regression

I'm trying to make an adjusted survival curve based on a weighted cox regression performed on a case cohort data set in R, but unfortunately, I can't make it work. I was therefore hoping that some of you may be able to figure it out why it isn't working.
In order to illustrate the problem, I have used (and adjusted a bit) the example from the "Package 'survival'" document, which means im working with:
data("nwtco")
subcoh <- nwtco$in.subcohort
selccoh <- with(nwtco, rel==1|subcoh==1)
ccoh.data <- nwtco[selccoh,]
ccoh.data$subcohort <- subcoh[selccoh]
ccoh.data$age <- ccoh.data$age/12 # Age in years
fit.ccSP <- cch(Surv(edrel, rel) ~ stage + histol + age,
data =ccoh.data,subcoh = ~subcohort, id=~seqno, cohort.size=4028, method="LinYing")
The data set is looking like this:
seqno instit histol stage study rel edrel age in.subcohort subcohort
4 4 2 1 4 3 0 6200 2.333333 TRUE TRUE
7 7 1 1 4 3 1 324 3.750000 FALSE FALSE
11 11 1 2 2 3 0 5570 2.000000 TRUE TRUE
14 14 1 1 2 3 0 5942 1.583333 TRUE TRUE
17 17 1 1 2 3 1 960 7.166667 FALSE FALSE
22 22 1 1 2 3 1 93 2.666667 FALSE FALSE
Then, I'm trying to illustrate the effect of stage in an adjusted survival curve, using the ggadjustedcurves-function from the survminer package:
library(suvminer)
ggadjustedcurves(fit.ccSP, variable = ccoh.data$stage, data = ccoh.data)
#Error in survexp(as.formula(paste("~", variable)), data = ndata, ratetable = fit) :
# Invalid rate table
But unfortunately, this is not working. Can anyone figure out why? And can this somehow be fixed or done in another way?
Essentially, I'm looking for a way to graphically illustrate the effect of a continuous variable in a weighted cox regression performed on a case cohort data set, so I would, generally, also be interested in hearing if there are other alternatives than the adjusted survival curves?
Two reasons it is throwing errors.
The ggadjcurves function is not being given a coxph.object, which it's halp page indicated was the designed first object.
The specification of the variable argument is incorrect. The correct method of specifying a column is with a length-1 character vector that matches one of the names in the formula. You gave it a vector whose value was a vector of length 1154.
This code succeeds:
fit.ccSP <- coxph(Surv(edrel, rel) ~ stage + histol + age,
data =ccoh.data)
ggadjustedcurves(fit.ccSP, variable = 'stage', data = ccoh.data)
It might not answer your desires, but it does answer the "why-error" part of your question. You might want to review the methods used by Therneau, Cynthia S Crowson, and Elizabeth J Atkinson in their paper on adjusted curves:
https://cran.r-project.org/web/packages/survival/vignettes/adjcurve.pdf

Clustering with Mclust results in an empty cluster

I am trying to cluster my empirical data using Mclust. When using the following, very simple code:
library(reshape2)
library(mclust)
data <- read.csv(file.choose(), header=TRUE, check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)
R gives me the following result:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust E (univariate, equal variance) model with 4 components:
log-likelihood n df BIC ICL
-20504.71 3258 8 -41074.13 -44326.69
Clustering table:
1 2 3 4
0 2271 896 91
Mixing probabilities:
1 2 3 4
0.2807685 0.4342499 0.2544305 0.0305511
Means:
1 2 3 4
1381.391 1381.715 1574.335 1851.667
Variances:
1 2 3 4
7466.189 7466.189 7466.189 7466.189
Edit: Here my data for download https://www.file-upload.net/download-14320392/example.csv.html
I do not readily understand why Mclust gives me an empty cluster (0), especially with nearly identical mean values to the second cluster. This only appears when specifically looking for an univariate, equal variance model. Using for example modelNames="V" or leaving it default, does not produce this problem.
This thread: Cluster contains no observations has a similary problem, but if I understand correctly, this appeared to be due to randomly generated data?
I am somewhat clueless as to where my problem is or if I am missing anything obvious.
Any help is appreciated!
As you noted the mean of cluster 1 and 2 are extremely similar, and it so happens that there's quite a lot of data there (see spike on histogram):
set.seed(111)
data <- read.csv("example.csv", header=TRUE, check.names = FALSE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
hist(data$value,br=50)
abline(v=fit$parameters$mean,
col=c("#FF000080","#0000FF80","#BEBEBE80","#BEBEBE80"),lty=8)
Briefly, mclust or gmm are probabilistic models, which estimates the mean / variance of clusters and also the probabilities of each point belonging to each cluster. This is unlike k-means provides a hard assignment. So the likelihood of the model is the sum of the probabilities of each data point belonging to each cluster, you can check it out also in mclust's publication
In this model, the means of cluster 1 and cluster 2 are near but their expected proportions are different:
fit$parameters$pro
[1] 0.28565736 0.42933294 0.25445342 0.03055627
This means if you have a data point that is around the means of 1 or 2, it will be consistently assigned to cluster 2, for example let's try to predict data points from 1350 to 1400:
head(predict(fit,1350:1400)$z)
1 2 3 4
[1,] 0.3947392 0.5923461 0.01291472 2.161694e-09
[2,] 0.3945941 0.5921579 0.01324800 2.301397e-09
[3,] 0.3944456 0.5919646 0.01358975 2.450108e-09
[4,] 0.3942937 0.5917661 0.01394020 2.608404e-09
[5,] 0.3941382 0.5915623 0.01429955 2.776902e-09
[6,] 0.3939790 0.5913529 0.01466803 2.956257e-09
The $classification is obtained by taking the column with the maximum probability. So, same example, everything is assigned to 2:
head(predict(fit,1350:1400)$classification)
[1] 2 2 2 2 2 2
To answer your question, no you did not do anything wrong, it's a fallback at least with this implementation of GMM. I would say it's a bit of overfitting, but you can basically take only the clusters that have a membership.
If you use model="V", i see the solution is equally problematic:
fitv <- Mclust(Data$value, modelNames="V", G = 1:7)
plot(fitv,what="classification")
Using scikit learn GMM I don't see a similar issue.. So if you need to use a gaussian mixture with spherical means, consider using a fuzzy kmeans:
library(ClusterR)
plot(NULL,xlim=range(data),ylim=c(0,4),ylab="cluster",yaxt="n",xlab="values")
points(data$value,fit_kmeans$clusters,pch=19,cex=0.1,col=factor(fit_kmeans$clusteraxis(2,1:3,as.character(1:3))
If you don't need equal variance, you can use the GMM function in the ClusterR package too.

Bug with VGAM? vglm family=posnegbinomial => "Error in if (take.half.step) { : missing value where TRUE/FALSE needed"

I have some actual data that I am afraid is somewhat nasty.
It's essentially a Positive Negative Binomial distribution (without any zero counts). However, there are some outliers that seem to cause some bad calculations to occur (maybe underflow or NaNs?) The first 8 or so entries are reasonable, but I'm guessing the last few are causing some problems with the fitting.
Here's the data:
> df
counts t
1 1968 1
2 217 2
3 55 3
4 26 4
5 11 5
6 5 6
7 8 7
8 3 8
9 1 10
10 1 11
11 1 12
12 1 13
13 1 15
14 1 18
15 1 26
16 1 59
This command runs for a while and then spits out the error message
> vglm(counts ~ t, data=df, family = posnegbinomial)
Error in if (take.half.step) { : missing value where TRUE/FALSE needed
BUT, if I rerun this cutting off the outliers, I get a solution for posnegbinomial
> vglm(counts ~ t, data=df[1:9,], family = posnegbinomial)
Call:
vglm(formula = counts ~ t, family = posnegbinomial, data = df[1:9,])
Coefficients:
(Intercept):1 (Intercept):2 t
7.7487404 0.7983811 -0.9427189
Degrees of Freedom: 18 Total; 15 Residual
Log-likelihood: -36.21064
If I try the family pospoisson (Positive Poisson: no zero values), I get a similar error "argument is not interpretable as logical".
I do notice that there are a number of similar questions in Stackoverflow about missing values where TRUE/FALSE is needed, but with other R packages. This indicates to me that perhaps the package writers need to better anticipate calculations might fail.
I think your proximal problem is that the predicted means for the negative binomial for your extreme values are so close to zero that they are underflowing to zero, in a way that was not anticipated/protected against by the package authors. (One thing to realize about nonlinear optimization/fitting is that it is always possible to break a fitting method by giving it extreme data ...)
I couldn't get this to work in VGAM, but I'll offer a couple of other suggestions.
plot(log(counts)~t,data=dd)
And eyeballing the data to get an initial estimate of parameter values (at least for the mean model):
m0 <- lm(log(counts)~t,data=subset(dd,t<10))
I thought I might be able to get vglm() to work by setting starting values, but that didn't actually pan out, even when I have fairly good values from other platforms (see below).
glmmADMB
The glmmADMB package can handle positive NB, via family="truncnbinom":
library(glmmADMB)
m1 <- glmmadmb(counts~t, data=dd, family="truncnbinom")
(there are some warning messages ...)
bbmle::mle2()
This requires a little bit more work: it failed with the standard model, but works if I set a floor on the predicted mean ...
library(VGAM) ## for dposnegbin
library(bbmle)
m2 <- mle2(counts~dposnegbin(size=exp(logk),
munb=pmax(exp(logeta),1e-7)),
parameters=list(logeta~t),
data=dd,
start=list(logk=0,logeta=0))
Again warning messages.
Compare glmmADMB, mle2, simple truncated lm fit ...
cc <- cbind(coef(m2),
c(log(m1$alpha),coef(m1)),
c(NA,coef(m0)))
dimnames(cc) <- list(c("log_k","log_int","slope"),
c("mle2","glmmADMB","lm"))
## mle2 glmmADMB lm
## log_k 0.8094678 0.8094625 NA
## log_int 7.7670604 7.7670637 7.1747551
## slope -0.9491796 -0.9491778 -0.8328487
This is in principle also possible with glmmTMB, but it runs into the same kinds of problems as vglm() ...

Resources