confusion matrix of bstTree predictions, Error: 'The data must contain some levels that overlap the reference.' - r

I am trying to train a model using bstTree method and print out the confusion matrix. adverse_effects is my class attribute.
set.seed(1234)
splitIndex <- createDataPartition(attended_num_new_bstTree$adverse_effects, p = .80, list = FALSE, times = 1)
trainSplit <- attended_num_new_bstTree[ splitIndex,]
testSplit <- attended_num_new_bstTree[-splitIndex,]
ctrl <- trainControl(method = "cv", number = 5)
model_bstTree <- train(adverse_effects ~ ., data = trainSplit, method = "bstTree", trControl = ctrl)
predictors <- names(trainSplit)[names(trainSplit) != 'adverse_effects']
pred_bstTree <- predict(model_bstTree$finalModel, testSplit[,predictors])
plot.roc(auc_bstTree)
conf_bstTree= confusionMatrix(pred_bstTree,testSplit$adverse_effects)
But I get the error 'Error in confusionMatrix.default(pred_bstTree, testSplit$adverse_effects) :
The data must contain some levels that overlap the reference.'
max(pred_bstTree)
[1] 1.03385
min(pred_bstTree)
[1] 1.011738
> unique(trainSplit$adverse_effects)
[1] 0 1
Levels: 0 1
How can I fix this issue?
> head(trainSplit)
type New_missed Therapytypename New_Diesease gender adverse_effects change_in_exposure other_reasons other_medication
5 2 1 14 13 2 0 0 0 0
7 2 0 14 13 2 0 0 0 0
8 2 0 14 13 2 0 0 0 0
9 2 0 14 13 2 1 0 0 0
11 2 1 14 13 2 0 0 0 0
12 2 0 14 13 2 0 0 0 0
uvb_puva_type missed_prev_dose skintypeA skintypeB Age DoseB DoseA
5 5 1 1 1 22 3.000 0
7 5 0 1 1 22 4.320 0
8 5 0 1 1 22 4.752 0
9 5 0 1 1 22 5.000 0
11 5 1 1 1 22 5.000 0
12 5 0 1 1 22 5.000 0

I had similar problem, which refers to this error. I used function confusionMatrix:
confusionMatrix(actual, predicted, cutoff = 0.5)
An I got the following error: Error in confusionMatrix.default(actual, predicted, cutoff = 0.5) : The data must contain some levels that overlap the reference.
I checked couple of things like:
class(actual) -> numeric
class(predicted) -> integer
unique(actual) -> plenty values, since it is probability
unique(predicted) -> 2 levels: 0 and 1
I concluded, that there is problem with applying cutoff part of the function, so I did it before by:
predicted<-ifelse(predicted> 0.5,1,0)
and run the confusionMatrix function, which works now just fine:
cm<- confusionMatrix(actual, predicted)
cm$table
which generated correct outcome.
One takeaway for your case, which might improve interpretation once you make code working:
you mixed input values for your confusion matrix(as per confusionMatrix package documetation), instead of:
conf_bstTree= confusionMatrix(pred_bstTree,testSplit$adverse_effects)
you should have written:
conf_bstTree= confusionMatrix(testSplit$adverse_effects,pred_bstTree)
As said it will most likely help you interpret confusion matrix, once you figure out way to make it work.
Hope it helps.

max(pred_bstTree) [1] 1.03385
min(pred_bstTree) [1] 1.011738
and errors tells it all. Plotting ROC is simply checking the effect of different threshold points. Based on threshold rounding happens e.g. 0.7 will be converted to 1 (TRUE class) and 0.3 will be go 0 (FALSE class); in case threshold is 0.5. Threshold values are in range of (0,1)
In your case regardless of threshold you will always get all observations into TRUE class as even minimum prediction is greater than 1. (Thats why #phiver was wondering if you are doing regression instead of classification) . Without any zero in prediction there is no level in 'prediction' which coincide with zero level in adverse_effects and hence this error.
PS: It will be difficult to tell root cause of error without you posting your data

Related

Test to compare proportions / paired (small) samples / 7-levels categorical variables

I'm working on data from a pre-post survey: the same participants have been asked the same questions at 2 different times (so the sample are not independant). I have 19 categorical variables (Likert scale with 7 levels).
For each question, I want to know if there is a significant difference between the "pre" and "post" answer. To do this, I want to compare proportions in each of the 7 categories between pre and post results.
I have two data bases (one 'pre' and one 'post') which I have merged as in the following example (I've made sure that the categorical variables have the same levels for PRE and POST):
prepost <- data.frame(ID = c(1:7),
Quest1_PRE = c('5_SomeA','1_StronglyD','3_SomeD','4_Neither','6_Agree','2_Disagree','7_StronglyA'),
Quest1_POST = c('1_StronglyD','7_StronglyA','6_Agree','7_StronglyA','3_SomeD','5_SomeA','7_StronglyA'))
I tried to perform a McNemar test:
temp <- table(prepost_S1$Quest1_PRE,prepost_S1$Quest1_POST)
mcnemar.test(temp)
> McNemar's Chi-squared test
data: temp
McNemar's chi-squared = NaN, df = 21, p-value = NA
But whatever the question, the test always return NA values. I think it is because the pivot table (temp) has very low frequencies (I only have 24 participants).
One exemple of a pivot table (I have 22 participants):
> temp
1_StronglyD 2_Disagree 3_SomeD 4_Neither 5_SomeA 6_Agree 7_StronglyA
1_StronglyD 0 0 0 0 0 1 0
2_Disagree 0 0 0 0 1 0 0
3_SomeD 0 0 0 0 0 1 1
4_Neither 0 0 1 1 2 2 2
5_SomeA 0 0 0 0 1 1 2
6_Agree 0 0 0 0 0 3 2
7_StronglyA 0 0 0 0 0 1 2
I've tried aggregating the variables' levels into 5 instead of 7 ("1_Disagree", "2_SomeD", "3_Neither", "4_SomeA", "5_Agree") but it still doesn't work.
Is there an equivalent of Fisher's exact test for paired sample? I've done research but I couldn't find anything helpful.
If not, could you think of any other test that could answer my question (= Do the answers differ significantly between the pre and post survey)?
Thanks!

plot does not show up for an svm object and no error is returned as well

I am trying to use svm() to classify my data. A sample of my data is as follows:
ID call_YearWeek week WeekCount oc
x 2011W01 1 0 0
x 2011W02 2 1 1
x 2011W03 3 0 0
x 2011W04 4 0 0
x 2011W05 5 1 1
x 2011W06 6 0 0
x 2011W07 7 0 0
x 2011W08 8 1 1
x 2011W09 9 0 0
x 2011W10 10 0 0
x 2011W11 11 0 0
x 2011W12 12 1 1
x 2011W13 13 1 1
x 2011W14 14 1 1
x 2011W15 15 0 0
x 2011W16 16 2 1
x 2011W17 17 0 0
x 2011W18 18 0 0
x 2011W19 19 1 1
The third column shows week of the year. The 4th column shows number of calls in that week and the last column is a binary factor (if a call was received in that week or not). I used the following lines of code:
train <- data[1:105,]
test <- data[106:157,]
model <- svm(oc~week,data=train)
plot(model,train,week)
plot(model,train)
none of the last two lines work. they dont show any plots and they return no error. I wonder why this is happening.
Thanks
Seems like there are two problems here, first is that not all svm types are supported by plot.svm -- only the classification methods are, and not the regression methods. Because your response is numeric, svm() assumes you want to do regression so it chooses "eps-regression" by default. If you want to do classification, change your response to a factor
model <- svm(factor(oc)~week,data=train)
which will then use "C-classification" by default.
The second problem is that there does not seem to be a univariate predictor plot implemented. It seems to want two variables (one for x and one for y).
It may be better to take a step back and describe exactly what you want your plot to look like.

coxph() X matrix deemed to be singular;

I'm having some trouble using coxph(). I've two categorical variables:"tecnologia" and "pais", and I want to evaluate the possible interaction effect of "pais" on "tecnologia"."tecnologia" is a variable factor with 2 levels: gps and convencional. And "pais" as 2 levels: PT and ES. I have no idea why this warning keeps appearing.
Here's the code and the output:
cox_AC<-coxph(Surv(dados_temp$dias_seg,dados_temp$status)~tecnologia*pais,data=dados_temp)
Warning message:
In coxph(Surv(dados_temp$dias_seg, dados_temp$status) ~ tecnologia * :
X matrix deemed to be singular; variable 3
> cox_AC
Call:
coxph(formula = Surv(dados_temp$dias_seg, dados_temp$status) ~
tecnologia * pais, data = dados_temp)
coef exp(coef) se(coef) z p
tecnologiagps -0.152 0.859 0.400 -0.38 7e-01
paisPT 1.469 4.345 0.406 3.62 3e-04
tecnologiagps:paisPT NA NA 0.000 NA NA
Likelihood ratio test=23.8 on 2 df, p=6.82e-06 n= 127, number of events= 64
I'm opening another question about this subject, although I made a similar one some months ago, because I'm facing the same problem again, with other data. And this time I'm sure it's not a data related problem.
Can somebody help me?
Thank you
UPDATE:
The problem does not seem to be a perfect classification
> xtabs(~status+tecnologia,data=dados)
tecnologia
status conv doppler gps
0 39 6 24
1 30 3 34
> xtabs(~status+pais,data=dados)
pais
status ES PT
0 71 8
1 49 28
> xtabs(~tecnologia+pais,data=dados)
pais
tecnologia ES PT
conv 69 0
doppler 1 8
gps 30 28
Here's a simple example which seems to reproduce your problem:
> library(survival)
> (df1 <- data.frame(t1=seq(1:6),
s1=rep(c(0, 1), 3),
te1=c(rep(0, 3), rep(1, 3)),
pa1=c(0,0,1,0,0,0)
))
t1 s1 te1 pa1
1 1 0 0 0
2 2 1 0 0
3 3 0 0 1
4 4 1 1 0
5 5 0 1 0
6 6 1 1 0
> (coxph(Surv(t1, s1) ~ te1*pa1, data=df1))
Call:
coxph(formula = Surv(t1, s1) ~ te1 * pa1, data = df1)
coef exp(coef) se(coef) z p
te1 -23 9.84e-11 58208 -0.000396 1
pa1 -23 9.84e-11 100819 -0.000229 1
te1:pa1 NA NA 0 NA NA
Now lets look for 'perfect classification' like so:
> (xtabs( ~ s1+te1, data=df1))
te1
s1 0 1
0 2 1
1 1 2
> (xtabs( ~ s1+pa1, data=df1))
pa1
s1 0 1
0 2 1
1 3 0
Note that a value of 1 for pa1 exactly predicts having a status s1 equal to 0. That is to say, based on your data, if you know that pa1==1 then you can be sure than s1==0. Thus fitting Cox's model is not appropriate in this setting and will result in numerical errors.
This can be seen with
> coxph(Surv(t1, s1) ~ pa1, data=df1)
giving
Warning message:
In fitter(X, Y, strats, offset, init, control, weights = weights, :
Loglik converged before variable 1 ; beta may be infinite.
It's important to look at these cross tables before fitting models. Also it's worth starting with simpler models before considering those involving interactions.
If we add the interaction term to df1 manually like this:
> (df1 <- within(df1,
+ te1pa1 <- te1*pa1))
t1 s1 te1 pa1 te1pa1
1 1 0 0 0 0
2 2 1 0 0 0
3 3 0 0 1 0
4 4 1 1 0 0
5 5 0 1 0 0
6 6 1 1 0 0
Then check it with
> (xtabs( ~ s1+te1pa1, data=df1))
te1pa1
s1 0
0 3
1 3
We can see that it's a useless classifier, i.e. it does not help predict status s1.
When combining all 3 terms, the fitter does manage to produce a numerical value for te1 and pe1 even though pe1 is a perfect predictor as above. However a look at the values for the coefficients and their errors shows them to be implausible.
Edit #JMarcelino: If you look at the warning message from the first coxph model in the example, you'll see the warning message:
2: In coxph(Surv(t1, s1) ~ te1 * pa1, data = df1) :
X matrix deemed to be singular; variable 3
Which is likely the same error you're getting and is due to this problem of classification. Also, your third cross table xtabs(~ tecnologia+pais, data=dados) is not as important as the table of status by interaction term. You could add the interaction term manually first as in the example above then check the cross table. Or you could say:
> with(df1,
table(s1, pa1te1=pa1*te1))
pa1te1
s1 0
0 3
1 3
That said, I notice one of the cells in your third table has a zero (conv, PT) meaning you have no observations with this combination of predictors. This is going to cause problems when trying to fit.
In general, the outcome should be have some values for all levels of the predictors and the predictors should not classify the outcome as exactly all or nothing or 50/50.
Edit 2 #user75782131 Yes, generally speaking xtabs or a similar cross-table should be performed in models where the outcome and predictors are discrete i.e. have a limited no. of levels. If 'perfect classification' is present then a predictive model / regression may not be appropriate. This is true for example for logistic regression (outcome is binary) as well as Cox's model.

R tree-based methods like randomForest, adaboost: interpret result of same data with different format

Suppose my dataset is a 100 x 3 matrix filled with categorical variables. I would like to do binary classification on the response variable. Let's make up a dataset with following code:
set.seed(2013)
y <- as.factor(round(runif(n=100,min=0,max=1),0))
var1 <- rep(c("red","blue","yellow","green"),each=25)
var2 <- rep(c("shortest","short","tall","tallest"),25)
df <- data.frame(y,var1,var2)
The data looks like this:
> head(df)
y var1 var2
1 0 red shortest
2 1 red short
3 1 red tall
4 1 red tallest
5 0 red shortest
6 1 red short
I tried to do random forest and adaboost on this data with two different approaches. The first approach is to use the data as it is:
> library(randomForest)
> randomForest(y~var1+var2,data=df,ntrees=500)
Call:
randomForest(formula = y ~ var1 + var2, data = df, ntrees = 500)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 1
OOB estimate of error rate: 44%
Confusion matrix:
0 1 class.error
0 29 22 0.4313725
1 22 27 0.4489796
----------------------------------------------------
> library(ada)
> ada(y~var1+var2,data=df)
Call:
ada(y ~ var1 + var2, data = df)
Loss: exponential Method: discrete Iteration: 50
Final Confusion Matrix for Data:
Final Prediction
True value 0 1
0 34 17
1 16 33
Train Error: 0.33
Out-Of-Bag Error: 0.33 iteration= 11
Additional Estimates of number of iterations:
train.err1 train.kap1
10 16
The second approach is to transform the dataset into wide format and treat each category as a variable. The reason I am doing this is because my actual dataset has 500+ factors in var1 and var2, and as a result, tree partitioning will always divide the 500 categories into 2 splits. A lot of information is lost by doing that. To transform the data:
id <- 1:100
library(reshape2)
tmp1 <- dcast(melt(cbind(id,df),id.vars=c("id","y")),id+y~var1,fun.aggregate=length)
tmp2 <- dcast(melt(cbind(id,df),id.vars=c("id","y")),id+y~var2,fun.aggregate=length)
df2 <- merge(tmp1,tmp2,by=c("id","y"))
The new data looks like this:
> head(df2)
id y blue green red yellow short shortest tall tallest
1 1 0 0 0 2 0 0 2 0 0
2 10 1 0 0 2 0 2 0 0 0
3 100 0 0 2 0 0 0 0 0 2
4 11 0 0 0 2 0 0 0 2 0
5 12 0 0 0 2 0 0 0 0 2
6 13 1 0 0 2 0 0 2 0 0
I apply random forest and adaboost to this new dataset:
> library(randomForest)
> randomForest(y~blue+green+red+yellow+short+shortest+tall+tallest,data=df2,ntrees=500)
Call:
randomForest(formula = y ~ blue + green + red + yellow + short + shortest + tall + tallest, data = df2, ntrees = 500)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 39%
Confusion matrix:
0 1 class.error
0 32 19 0.3725490
1 20 29 0.4081633
----------------------------------------------------
> library(ada)
> ada(y~blue+green+red+yellow+short+shortest+tall+tallest,data=df2)
Call:
ada(y ~ blue + green + red + yellow + short + shortest + tall +
tallest, data = df2)
Loss: exponential Method: discrete Iteration: 50
Final Confusion Matrix for Data:
Final Prediction
True value 0 1
0 36 15
1 20 29
Train Error: 0.35
Out-Of-Bag Error: 0.33 iteration= 26
Additional Estimates of number of iterations:
train.err1 train.kap1
5 10
The results from two approaches are different. The difference is more obvious as we introduce more levels in each variable, i.e., var1 and var2. My question is, since we are using exactly the same data, why is the result different? How should we interpret the results from both approaches? Which is more reliable?
While these two models look identical, they are fundamentally different from one another- On the second model, you implicitly include the possibility that a given observation may have multiple colors and multiple heights. The correct choice between the two model formulations will depend on the characteristics of your real-world observations. If these characters are exclusive (i.e., each observation is of a single color and height), the first formulation of the model will be the right one to use. However, if an observation may be both blue and green, or any other color combination, you may use the second formulation. From a hunch looking at your original data, it seems like the first one is most appropriate (i.e., how would an observation have multiple heights??).
Also, why did you code your logical variable columns in df2 as 0s and 2s instead of 0/1? I wander if that will have any impact on the fit depending on how the data is being coded as factor or numerical.

using graph.adjacency() in R

I have a sample code in R as follows:
library(igraph)
rm(list=ls())
dat=read.csv(file.choose(),header=TRUE,row.names=1,check.names=T) # read .csv file
m=as.matrix(dat)
net=graph.adjacency(adjmatrix=m,mode="undirected",weighted=TRUE,diag=FALSE)
where I used csv file as input which contain following data:
23732 23778 23824 23871 58009 58098 58256
23732 0 8 0 1 0 10 0
23778 8 0 1 15 0 1 0
23824 0 1 0 0 0 0 0
23871 1 15 0 0 1 5 0
58009 0 0 0 1 0 7 0
58098 10 1 0 5 7 0 1
58256 0 0 0 0 0 1 0
After this I used following command to check weight values:
E(net)$weight
Expected output is somewhat like this:
> E(net)$weight
[1] 8 1 10 1 15 1 1 5 7 1
But I'm getting weird values (and every time different):
> E(net)$weight
[1] 2.121996e-314 2.121996e-313 1.697597e-313 1.291034e-57 1.273197e-312 5.092790e-313 2.121996e-314 2.121996e-314 6.320627e-316 2.121996e-314 1.273197e-312 2.121996e-313
[13] 8.026755e-316 9.734900e-72 1.273197e-312 8.027076e-316 6.320491e-316 8.190221e-316 5.092790e-313 1.968065e-62 6.358638e-316
I'm unable to find where and what I am doing wrong?
Please help me to get the correct expected result and also please tell me why is this weird output and that too every time different when I run it.??
Thanks,
Nitin
Just a small working example below, much clearer than CSV input.
library('igraph');
adjm1<-matrix(sample(0:1,100,replace=TRUE,prob=c(0.9,01)),nc=10);
g1<-graph.adjacency(adjm1);
plot(g1)
P.s. ?graph.adjacency has a lot of good examples (remember to run library('igraph')).
Related threads
Creating co-occurrence matrix
Co-occurrence matrix using SAC?
The problem seems to be due to the data-type of the matrix elements. graph.adjacency expects elements of type numeric. Not sure if its a bug.
After you do,
m <- as.matrix(dat)
set its mode to numeric by:
mode(m) <- "numeric"
And then do:
net <- graph.adjacency(m, mode = "undirected", weighted = TRUE, diag = FALSE)
> E(net)$weight
[1] 8 1 10 1 15 1 1 5 7 1

Resources