Predict() with nested multinomial logit models - r

I'm using the mlogit package in R to create a nested multinomial logit model of healthcare provider choice given choice data I have. The data look like this:
ID RES weight age wealth educ married urban partnerAge totalChildren survivingChildren anyANC
1.0 2468158 FALSE 0.2609153 29 Poor Primary 1 0 31 4 4 1
1.1 2468158 TRUE 0.2609153 29 Poor Primary 1 0 31 4 4 1
1.2 2468158 FALSE 0.2609153 29 Poor Primary 1 0 31 4 4 1
1.3 2468158 FALSE 0.2609153 29 Poor Primary 1 0 31 4 4 1
2.0 14233860 FALSE 0.2754970 19 Poorest Primary 1 0 30 1 1 1
2.1 14233860 TRUE 0.2754970 19 Poorest Primary 1 0 30 1 1 1
2.2 14233860 FALSE 0.2754970 19 Poorest Primary 1 0 30 1 1 1
2.3 14233860 FALSE 0.2754970 19 Poorest Primary 1 0 30 1 1 1
outlier50Km optout alt spa mes dist bobs cobs Q fees chid educSec
1.0 0 -1 0 Home Home 0.000 0.0000000 0.000 0.00 0 1 0
1.1 0 -1 1 Health center Public 13.167 0.4898990 NA 0.64 0 1 0
1.2 0 -1 2 Health center Public 30.596 0.5202020 NA 0.56 0 1 0
1.3 0 -1 3 District hospital Public 41.164 0.7171717 0.825 0.88 0 1 0
2.0 0 -1 0 Home Home 0.000 0.0000000 0.000 0.00 0 2 0
2.1 0 -1 1 Health center Mission 14.756 0.7676768 NA 0.64 1 2 0
2.2 0 -1 2 Health center Public 41.817 0.3787879 NA 0.56 0 2 0
2.3 0 -1 3 District hospital Public 50.419 0.7171717 0.825 0.88 0 2 0
where spa, mes, dist, bobs, cobs, Q, and fees are characteristics of the provider and the remaining variables specific to the individual. These data are in long format, meaning each individual has four rows, reflecting her four choices (alt = 0:3), with RES being the response variable.
An un-nested model behaves appropriately
f.full <- RES ~ 0 + dist + Q + bobs + fees + spa | 0 + age + wealth + educSec + married + urban + totalChildren + survivingChildren
choice.ml.full <- mlogit(formula = f.full, data = data, weights = weight)
predict(choice.ml.full, data[1:8,])
0 1 2 3
[1,] 0.1124429 0.7739403 0.06893341 0.04468343
[2,] 0.4465272 0.3107375 0.11490317 0.12783210
By all measures of model fit, however, a nested model is better than an un-nested one. The nested model gives me coefficients appropriately:
ns2 <- mlogit(formula = f.full, nests = list(home = "0", useCare = c("1", "2", "3")), data = data, weight = weight, un.nest.el = TRUE)
summary(ns2)
Call:
mlogit(formula = f.full, data = data, weights = weight, nests = list(home = "0",
useCare = c("1", "2", "3")), un.nest.el = TRUE)
Frequencies of alternatives:
0 1 2 3
0.094378 0.614216 0.194327 0.097079
bfgs method
23 iterations, 0h:0m:13s
g'(-H)^-1g = 9.51E-07
gradient close to zero
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
dist -0.0336233 0.0040136 -8.3773 < 2.2e-16 ***
Q 0.1780058 0.0768181 2.3172 0.0204907 *
bobs -0.0695695 0.0505795 -1.3754 0.1689925
fees -0.8488132 0.1001928 -8.4718 < 2.2e-16 ***
etc...
But, I get the following error if I try to predict on a single individual:
predict(ns2, data[1:4,])
Error in apply(Gl, 1, sum) : dim(X) must have a positive length
and a different error if I try to predict on more than one individual:
predict(ns2, data[1:8,])
Error in solve.default(crossprod(attr(x, "gradi")[, !fixed])) :
Lapack routine dgesv: system is exactly singular: U[5,5] = 0
Any help would be vastly appreciated.

Related

How can I edit this apriori algorithm code in market basket analysis using rhs and lhs?

I'm trying Market Bakset Analysis using 'apriori' algorithm.
The 'Catalog' data set consists of 8 columns and 200 rows :
Automotive Computers Personal.Electronics Garden Clothing Health Jewelry Housewares
0 1 0 1 1 0 0 0 1
0 0 0 1 0 1 0 0 1
0 1 1 0 0 0 1 0 0
1 0 0 0 0 1 0 0 0
0 0 0 1 1 0 0 1 0
First, I tried apriori algorithm withtout any rhs, lhs limitation and here's the result :
Catalogrules <- apriori(Catalog, parameter= list(support =
0.1, confidence = 0.3, minlen = 2))
inspect(sort(Catalogrules, by = "lift")[1:5])
lhs rhs support confidence coverage lift count
\[1\] {Automotive=0,
Computers=0,
Personal.Electronics=0,
Clothing=0} =\> {Garden=1} 0.110 0.9565217 0.115 2.943144 22
\[2\] {Personal.Electronics=0,
Jewelry=1} =\> {Clothing=1} 0.125 0.7142857 0.175 2.857143 25
\[3\] {Clothing=1,
Health=0} =\> {Jewelry=1} 0.120 0.8888889 0.135 2.821869 24
\[4\] {Automotive=0,
Personal.Electronics=0,
Clothing=0} =\> {Garden=1} 0.110 0.9166667 0.120 2.820513 22
\[5\] {Computers=0,
Personal.Electronics=0,
Jewelry=1} =\> {Clothing=1} 0.105 0.6774194 0.155 2.709677 21
I wanted to see rules with "=1" because '0' cannot show any meaningful relationships.
ex) {Jewelry=1, Computers=1} -> {Clothing=1}
So, I tried to make new code using rhs and lhs.
Catalogrules <- apriori(Catalog, parameter= list(support =
0.1, confidence = 0.3, minlen = 2), appearance=list(rhs = c("Automotive=1", "Computers=1", "Personal.Electronics=1", "Garden=1", "Clothing=1", "Health=1", "Jewelry=1", "Housewares=1"), lhs = c("Automotive=1", "Computers=1", "Personal.Electronics=1", "Garden=1", "Clothing=1", "Health=1", "Jewelry=1", "Housewares=1"), default="rhs"))
But this is the error message:
Error in asMethod(object) :
The following items cannot be specified in multiple appearance locations: Automotive=1, Computers=1, Personal.Electronics=1, Garden=1, Clothing=1, Health=1, Jewelry=1, Housewares=1
I want to print top 5 rules using apriori algorithm.
ex) inspect(sort(Catalogrules, by = "lift")[1:5])
1. {Automotive=1} -> {Garden=1}
2. {Personal.Electronics=1} -> {Computers=1}
3. {Jewelry=1} -> {Clothing=1}
4. {Garden=1} -> (Automotives=1}
5. {Clothing=1} -> {Jewelry=1}
I need your help.

How to apply the gmnl function in R for a latent class analysis of a discrete choice experiement?

I try to perform a latent class analysis on my data from a discrete choice experiment. The respondents needed to chose between 2 options with as attributes: the number of children they prefer, and the educational level they prefer for their children (stated as a mixture of the number of children). The first rows of my data look like this:
Respondent Block Choice card Chosen FNoPrimary FPrimary FSecondary FTertiary MNoPrimary
1 1 1 1 0.0000000 0.0000000 0.00 0.0000000 0.0000000
1 1 1 0 0.3333333 0.6666667 0.00 0.0000000 0.0000000
1 2 12 0 0.3333333 0.3333333 0.00 0.0000000 0.0000000
1 2 12 1 0.1666667 0.0000000 0.00 0.3333333 0.1666667
1 3 2 0 0.0000000 0.0000000 1.00 0.0000000 0.0000000
1 3 2 1 0.0000000 0.0000000 0.25 0.0000000 0.0000000
MPrimary MSecondary MTertiary NChildren Age District Religion Indigenous Ethnic group Sex
1 0 1.00 0.0000000 1 18 0 Protestant 0 Wolaita Female
2 0 0.00 0.0000000 3 18 0 Protestant 0 Wolaita Female
3 0 0.00 0.3333333 9 18 0 Protestant 0 Wolaita Female
4 0 0.00 0.3333333 12 18 0 Protestant 0 Wolaita Female
5 0 0.00 0.0000000 1 18 0 Protestant 0 Wolaita Female
6 0 0.25 0.5000000 4 18 0 Protestant 0 Wolaita Female
Educational level Studentornot Farmerornot Marital status Having children Ever used contraception
1 High school - grade 10 1 0 0 0 0
2 High school - grade 10 1 0 0 0 0
3 High school - grade 10 1 0 0 0 0
4 High school - grade 10 1 0 0 0 0
5 High school - grade 10 1 0 0 0 0
6 High school - grade 10 1 0 0 0 0
Alternative
1 1
2 2
3 1
4 2
5 1
6 2
I looked at all the packages available in R and I think that only the gmnl package can handle my type of data and is able to add covariates. However, if I compare the output of my latent class analysis for a simple linear model with only 2 covariates (age and district) (as stated below), I become a totally different output for the parameter estimates and the loglikelihood (-2598.6 in Stata vs -2495.5 in R) then when I perform the same analysis with Stata (see code below).
in R:
defining_data <- mlogit.data(final_data_alternativeadded, id.var = "Respondent", choice = "Chosen", alt.var = "Alternative", chid.var="Choice.card", group.var = "Block", varying = 7:15, shape = "long")
mnl <- gmnl(Chosen ~ 1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + z | 0 | 0 | 0 | Age + District, data = defining_data, model = 'lc', Q = 3)
summary(mnl)
in Stata:
ssc install lclogit
ssc install fmlogit
lclogit chosen fprimary fsecondary ftertiary mnoprimary mprimary msecondary mtertiary block nchildren, group(choicecard) id(respondent) nclasses(3) membership(age district)
I tried to make all my variables numeric, to multiply the mixture proportions by the number of children to get values which are closer to each other, to order my dataset based on the value of the number of choice cards per respondents... but I always get other values for the latent class probabilities. Does someone know why? I know that lclogit in Stata uses the Expectation Maximization algorithm and gmnl in R the Maximum Likelihood approach but I don't think that the parameter estimates and the loglikelihood can completely differ because of that.
I also tried it with with an existing open dataset:
Stata:
use http://fmwww.bc.edu/repec/bocode/t/traindata.dta, clear
ssc install lclogit
ssc install fmlogit
lclogit y price contract local wknown tod seasonal, group(gid) id(pid) nclasses(3) seed(12345)
R:
getwd()
traindata <- read.xlsx("traindata.xlsx", 1, header = TRUE)
library(mlogit)
library(gmnl)
traindata[1:nrow(traindata),11] <- seq(1,4) #to add a column with the alternatives per choice numbered from 1 to 4
names(traindata) <- c('y', 'price', 'contract', 'local', 'wknown', 'tod', 'seasonal', 'gid', 'pid', 'X_xi', 'Alternative')
TM <- mlogit.data(traindata, choice = "y", id.var = "pid", alt.var = "Alternative", chid.var = "gid", shape = "long")
mnl <- gmnl(y ~ price + contract + local + wknown + tod + seasonal | 0 | 0 | 0 | 1, data = TM, model = 'lc', Q = 3)
summary(mnl)
However, Stata states that they become a loglikelihood of -1117.9987 while R becomes -1329.5 and both have totally different parameter estimations.
Does someone know why this is the case?
Thank you very much in advance
Kind regards
Eva
There are a couple of things here that might be useful to think about.
You would expect some minor differences between the two because i) latent class models are susceptible to converging to different local optima depending on starting values and ii) the expectation maximization algorithm is more likely to compared to maximum likelihood. However, this is unlikely to be the case here.
What is likely the issue is that Stata takes into consideration the panel structure of the data, whereas in R you have specified a cross-sectional latent class model treating each observation as independent. The large difference in LL value is a strong indicator of this.
Try this instead (with panel = TRUE):
mnl <- gmnl(y ~ price + contract + local + wknown + tod + seasonal | 0 | 0 | 0 | 1,
data = TM, model = 'lc', Q = 3, panel = TRUE)

big dataframe: "repeated" t-test between groups for thousand of factors

I have read a lot of posts related to data wrangling and “repeated” t-test but I can’t figure out the way to achieve it in my case.
You can get my example dataset for StackOverflow here: https://www.dropbox.com/s/0b618fs1jjnuzbg/dataset.example.stckovflw.txt?dl=0
I have a big dataframe of gen expression like:
> b<-read.delim("dataset.example.stckovflw.txt")
> head(b)
animal gen condition tissue LogFC
1 animalcontrol1 kjhss1 control brain 7.129283
2 animalcontrol1 sdth2 control brain 7.179909
3 animalcontrol1 sgdhstjh20 control brain 9.353147
4 animalcontrol1 jdygfjgdkydg21 control brain 6.459432
5 animalcontrol1 shfjdfyjydg22 control brain 9.372865
6 animalcontrol1 jdyjkdg23 control brain 9.541097
> str(b)
'data.frame': 21507 obs. of 5 variables:
$ animal : Factor w/ 25 levels "animalcontrol1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ gen : Factor w/ 1131 levels "dghwg1041","dghwg1086",..: 480 761 787 360 863 385 133 888 563 738 ...
$ condition: Factor w/ 5 levels "control","treatmentA",..: 1 1 1 1 1 1 1 1 1 1 ...
$ tissue : Factor w/ 2 levels "brain","heart": 1 1 1 1 1 1 1 1 1 1 ...
$ LogFC : num 7.13 7.18 9.35 6.46 9.37 ...
Each group has 5 animals, and each animals has many gens quantified. (However, each animal may possibly have a different set of quantified gens, but also many of the gens will be in common between animals and groups).
I would like to perform t-test for each gen between my treated group (A, B, C or D) and the controls. The data should be presented as a table containing the p- value for each gen in each group.
Because I have so many gens (thousand), I cannot subset each gen.
Do you know how could I automate the procedure ?
I was thinking about a loop but I am absolutely not sure it could achieve what I want and how to proceed.
Also, I was looking more at these posts using the apply function : Apply t-test on many columns in a dataframe split by factor and Looping through t.tests for data frame subsets in r
#
################ additionnal information after reading first comments and answers :
#andrew_reece : Thank you very much for this. It is almost-exactly what I was looking for. However, I can’t find the way to do it with t-test. ANOVA is interesting information, but then I will need to know which of the treated groups is/are significantly different from my controls. Also I would need to know which treated group is significantly different from each others, “two by two”.
I have been trying to use your code by changing the “aov(..)” in “t.test(…)”. For that, first I realize a subset(b, condition == "control" | condition == "treatmentA" ) in order to compare only two groups. However, when exporting the result table in csv file, the table is unanderstandable (no gen name, no p-values, etc, only numbers). I will keep searching a way to do it properly but until now I’m stuck.
#42:
Thank you very much for these tips. This is just a dataset example, let’s assume we do have to use individual t-tests.
This is very useful start for exploring my data. For example, I have been trying to reprsent my data with Venndiagrams. I can write my code but it is kind of out of the initial topic. Also, I don't know how to summarize in a less fastidious way the shared "gene" detected in each combination of conditions so i have simplified with only 3 conditions.
# Visualisation of shared genes by VennDiagrams :
# let's simplify and consider only 3 conditions :
b<-read.delim("dataset.example.stckovflw.txt")
b<- subset(b, condition == "control" | condition == "treatmentA" | condition == "treatmentB")
b1<-table(b$gen, b$condition)
b1
b2<-subset(data.frame(b1, "control" > 2
|"treatmentA" > 2
|"treatmentB" > 2 ))
b3<-subset(b2, Freq>2) # select only genes that have been quantified in at least 2 animals per group
b3
b4 = within(b3, {
Freq = ifelse(Freq > 1, 1, 0)
}) # for those observations, we consider the gene has been detected so we change the value 0 regardless the freq of occurence (>2)
b4
b5<-table(b4$Var1, b4$Var2)
write.csv(b5, file = "b5.csv")
# make an intermediate file .txt (just add manually the name of the cfirst column title)
# so now we have info
bb5<-read.delim("bb5.txt")
nrow(subset(bb5, control == 1))
nrow(subset(bb5, treatmentA == 1))
nrow(subset(bb5, treatmentB == 1))
nrow(subset(bb5, control == 1 & treatmentA == 1))
nrow(subset(bb5, control == 1 & treatmentB == 1))
nrow(subset(bb5, treatmentA == 1 & treatmentB == 1))
nrow(subset(bb5, control == 1 & treatmentA == 1 & treatmentB == 1))
library(grid)
library(futile.logger)
library(VennDiagram)
venn.plot <- draw.triple.venn(area1 = 1005,
area2 = 927,
area3 = 943,
n12 = 843,
n23 = 861,
n13 = 866,
n123 = 794,
category = c("controls", "treatmentA", "treatmentB"),
fill = c("red", "yellow", "blue"),
cex = 2,
cat.cex = 2,
lwd = 6,
lty = 'dashed',
fontface = "bold",
fontfamily = "sans",
cat.fontface = "bold",
cat.default.pos = "outer",
cat.pos = c(-27, 27, 135),
cat.dist = c(0.055, 0.055, 0.085),
cat.fontfamily = "sans",
rotation = 1);
Update (per OP comments):
Pairwise comparison across condition can be managed with an ANOVA post-hoc test, such as Tukey's Honest Significant Difference (stats::TukeyHSD()). (There are others, this is just one way to demonstrate PoC.)
results <- b %>%
mutate(condition = factor(condition)) %>%
group_by(gen) %>%
filter(length(unique(condition)) >= 2) %>%
nest() %>%
mutate(
model = map(data, ~ TukeyHSD(aov(LogFC ~ condition, data = .x))),
coef = map(model, ~ broom::tidy(.x))
) %>%
unnest(coef) %>%
select(-term)
results
# A tibble: 7,118 x 6
gen comparison estimate conf.low conf.high adj.p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 kjhss1 treatmentA-control 1.58 -20.3 23.5 0.997
2 kjhss1 treatmentC-control -3.71 -25.6 18.2 0.962
3 kjhss1 treatmentD-control 0.240 -21.7 22.2 1.000
4 kjhss1 treatmentC-treatmentA -5.29 -27.2 16.6 0.899
5 kjhss1 treatmentD-treatmentA -1.34 -23.3 20.6 0.998
6 kjhss1 treatmentD-treatmentC 3.95 -18.0 25.9 0.954
7 sdth2 treatmentC-control -1.02 -21.7 19.7 0.991
8 sdth2 treatmentD-control 3.25 -17.5 24.0 0.909
9 sdth2 treatmentD-treatmentC 4.27 -16.5 25.0 0.849
10 sgdhstjh20 treatmentC-control -7.48 -30.4 15.5 0.669
# ... with 7,108 more rows
Original answer
You can use tidyr::nest() and purrr::map() to accomplish the technical task of grouping by gen, and then conducting statistical tests comparing the effects of condition (presumably with LogFC as your DV).
But I agree with the other comments that there are issues with your statistical approach here that bear careful consideration - stats.stackexchange.com is a better forum for those questions.
For the purpose of demonstration, I've used an ANOVA instead of a t-test, since there are frequently more than two conditions per gen grouping. This shouldn't really change the intuition behind the implementation, however.
require(tidyverse)
results <- b %>%
mutate(condition = factor(condition)) %>%
group_by(gen) %>%
filter(length(unique(condition)) >= 2) %>%
nest() %>%
mutate(
model = map(data, ~ aov(LogFC ~ condition, data = .x)),
coef = map(model, ~ broom::tidy(.x))
) %>%
unnest(coef)
A few cosmetic trimmings to get closer to your original vision (of just a table with gen and p-values), although note that this really leaves a lot of important information out and I'm not advising you actually limit your results in this way.
results %>%
filter(term!="Residuals") %>%
select(gen, df, statistic, p.value)
results
# A tibble: 1,111 x 4
gen df statistic p.value
<chr> <dbl> <dbl> <dbl>
1 kjhss1 3. 0.175 0.912
2 sdth2 2. 0.165 0.850
3 sgdhstjh20 2. 0.440 0.654
4 jdygfjgdkydg21 2. 0.267 0.770
5 shfjdfyjydg22 2. 0.632 0.548
6 jdyjkdg23 2. 0.792 0.477
7 fckjhghw24 2. 0.790 0.478
8 shsnv25 2. 1.15 0.354
9 qeifyvj26 2. 0.588 0.573
10 qsiubx27 2. 1.14 0.359
# ... with 1,101 more rows
Note: I can't take much credit for this approach - it's taken almost verbatim from an example I saw Hadley give at a talk last night on purrr. Here's a link to the public repo of the demo code he used, which covers a similar use case.
You have 25 animals in 5 different treatment groups with a varying number of gen-values (presumably activities of genetic probes) in two different tissues:
table(b$animal, b$condition)
control treatmentA treatmentB treatmentC treatmentD
animalcontrol1 1005 0 0 0 0
animalcontrol2 857 0 0 0 0
animalcontrol3 959 0 0 0 0
animalcontrol4 928 0 0 0 0
animalcontrol5 1005 0 0 0 0
animaltreatmentA1 0 927 0 0 0
animaltreatmentA2 0 883 0 0 0
animaltreatmentA3 0 908 0 0 0
animaltreatmentA4 0 861 0 0 0
animaltreatmentA5 0 927 0 0 0
animaltreatmentB1 0 0 943 0 0
animaltreatmentB2 0 0 841 0 0
animaltreatmentB3 0 0 943 0 0
animaltreatmentB4 0 0 910 0 0
animaltreatmentB5 0 0 943 0 0
animaltreatmentC1 0 0 0 742 0
animaltreatmentC2 0 0 0 724 0
animaltreatmentC3 0 0 0 702 0
animaltreatmentC4 0 0 0 698 0
animaltreatmentC5 0 0 0 742 0
animaltreatmentD1 0 0 0 0 844
animaltreatmentD2 0 0 0 0 776
animaltreatmentD3 0 0 0 0 812
animaltreatmentD4 0 0 0 0 783
animaltreatmentD5 0 0 0 0 844
Agree you need to "automate" this in some fashion, but I think you are in need of a more general strategy for statistical inference rather than trying to pick out relationships by applying individual t-tests. You might consider either mixed models or one of the random forest variants. I think you should be discussing this with a statistician. As an example of where your hopes are not going to be met, take a look at the information you have about the first "gen" among the 1131 values:
str( b[ b$gen == "dghwg1041", ])
'data.frame': 13 obs. of 5 variables:
$ animal : Factor w/ 25 levels "animalcontrol1",..: 1 6 11 2 7 12 3 8 13 14 ...
$ gen : Factor w/ 1131 levels "dghwg1041","dghwg1086",..: 1 1 1 1 1 1 1 1 1 1 ...
$ condition: Factor w/ 5 levels "control","treatmentA",..: 1 2 3 1 2 3 1 2 3 3 ...
$ tissue : Factor w/ 2 levels "brain","heart": 1 1 1 1 1 1 1 1 1 1 ...
$ LogFC : num 4.34 2.98 4.44 3.87 2.65 ...
You do have a fair number with "complete representation:
gen_length <- ave(b$LogFC, b$gen, FUN=length)
Hmisc::describe(gen_length)
#--------------
gen_length
n missing distinct Info Mean Gmd .05 .10
21507 0 18 0.976 20.32 4.802 13 14
.25 .50 .75 .90 .95
18 20 24 25 25
Value 5 8 9 10 12 13 14 15 16 17
Frequency 100 48 288 270 84 624 924 2220 64 527
Proportion 0.005 0.002 0.013 0.013 0.004 0.029 0.043 0.103 0.003 0.025
Value 18 19 20 21 22 23 24 25
Frequency 666 2223 3840 42 220 1058 3384 4925
Proportion 0.031 0.103 0.179 0.002 0.010 0.049 0.157 0.229
You might start by looking at all the "gen"s that have complete data:
head( gen_tbl[ gen_tbl == 25 ], 25)
#------------------
dghwg1131 dghwg546 dghwg591 dghwg636 dghwg681
25 25 25 25 25
dghwg726 dgkuck196 dgkuck286 dgkuck421 dgkuck691
25 25 25 25 25
dgkuck736 dgkukdgse197 dgkukdgse287 dgkukdgse422 dgkukdgse692
25 25 25 25 25
dgkukdgse737 djh592 djh637 djh682 djh727
25 25 25 25 25
dkgkjd327 dkgkjd642 dkgkjd687 dkgkjd732 fckjhghw204
25 25 25 25 25

confusion matrix of bstTree predictions, Error: 'The data must contain some levels that overlap the reference.'

I am trying to train a model using bstTree method and print out the confusion matrix. adverse_effects is my class attribute.
set.seed(1234)
splitIndex <- createDataPartition(attended_num_new_bstTree$adverse_effects, p = .80, list = FALSE, times = 1)
trainSplit <- attended_num_new_bstTree[ splitIndex,]
testSplit <- attended_num_new_bstTree[-splitIndex,]
ctrl <- trainControl(method = "cv", number = 5)
model_bstTree <- train(adverse_effects ~ ., data = trainSplit, method = "bstTree", trControl = ctrl)
predictors <- names(trainSplit)[names(trainSplit) != 'adverse_effects']
pred_bstTree <- predict(model_bstTree$finalModel, testSplit[,predictors])
plot.roc(auc_bstTree)
conf_bstTree= confusionMatrix(pred_bstTree,testSplit$adverse_effects)
But I get the error 'Error in confusionMatrix.default(pred_bstTree, testSplit$adverse_effects) :
The data must contain some levels that overlap the reference.'
max(pred_bstTree)
[1] 1.03385
min(pred_bstTree)
[1] 1.011738
> unique(trainSplit$adverse_effects)
[1] 0 1
Levels: 0 1
How can I fix this issue?
> head(trainSplit)
type New_missed Therapytypename New_Diesease gender adverse_effects change_in_exposure other_reasons other_medication
5 2 1 14 13 2 0 0 0 0
7 2 0 14 13 2 0 0 0 0
8 2 0 14 13 2 0 0 0 0
9 2 0 14 13 2 1 0 0 0
11 2 1 14 13 2 0 0 0 0
12 2 0 14 13 2 0 0 0 0
uvb_puva_type missed_prev_dose skintypeA skintypeB Age DoseB DoseA
5 5 1 1 1 22 3.000 0
7 5 0 1 1 22 4.320 0
8 5 0 1 1 22 4.752 0
9 5 0 1 1 22 5.000 0
11 5 1 1 1 22 5.000 0
12 5 0 1 1 22 5.000 0
I had similar problem, which refers to this error. I used function confusionMatrix:
confusionMatrix(actual, predicted, cutoff = 0.5)
An I got the following error: Error in confusionMatrix.default(actual, predicted, cutoff = 0.5) : The data must contain some levels that overlap the reference.
I checked couple of things like:
class(actual) -> numeric
class(predicted) -> integer
unique(actual) -> plenty values, since it is probability
unique(predicted) -> 2 levels: 0 and 1
I concluded, that there is problem with applying cutoff part of the function, so I did it before by:
predicted<-ifelse(predicted> 0.5,1,0)
and run the confusionMatrix function, which works now just fine:
cm<- confusionMatrix(actual, predicted)
cm$table
which generated correct outcome.
One takeaway for your case, which might improve interpretation once you make code working:
you mixed input values for your confusion matrix(as per confusionMatrix package documetation), instead of:
conf_bstTree= confusionMatrix(pred_bstTree,testSplit$adverse_effects)
you should have written:
conf_bstTree= confusionMatrix(testSplit$adverse_effects,pred_bstTree)
As said it will most likely help you interpret confusion matrix, once you figure out way to make it work.
Hope it helps.
max(pred_bstTree) [1] 1.03385
min(pred_bstTree) [1] 1.011738
and errors tells it all. Plotting ROC is simply checking the effect of different threshold points. Based on threshold rounding happens e.g. 0.7 will be converted to 1 (TRUE class) and 0.3 will be go 0 (FALSE class); in case threshold is 0.5. Threshold values are in range of (0,1)
In your case regardless of threshold you will always get all observations into TRUE class as even minimum prediction is greater than 1. (Thats why #phiver was wondering if you are doing regression instead of classification) . Without any zero in prediction there is no level in 'prediction' which coincide with zero level in adverse_effects and hence this error.
PS: It will be difficult to tell root cause of error without you posting your data

coxph() X matrix deemed to be singular;

I'm having some trouble using coxph(). I've two categorical variables:"tecnologia" and "pais", and I want to evaluate the possible interaction effect of "pais" on "tecnologia"."tecnologia" is a variable factor with 2 levels: gps and convencional. And "pais" as 2 levels: PT and ES. I have no idea why this warning keeps appearing.
Here's the code and the output:
cox_AC<-coxph(Surv(dados_temp$dias_seg,dados_temp$status)~tecnologia*pais,data=dados_temp)
Warning message:
In coxph(Surv(dados_temp$dias_seg, dados_temp$status) ~ tecnologia * :
X matrix deemed to be singular; variable 3
> cox_AC
Call:
coxph(formula = Surv(dados_temp$dias_seg, dados_temp$status) ~
tecnologia * pais, data = dados_temp)
coef exp(coef) se(coef) z p
tecnologiagps -0.152 0.859 0.400 -0.38 7e-01
paisPT 1.469 4.345 0.406 3.62 3e-04
tecnologiagps:paisPT NA NA 0.000 NA NA
Likelihood ratio test=23.8 on 2 df, p=6.82e-06 n= 127, number of events= 64
I'm opening another question about this subject, although I made a similar one some months ago, because I'm facing the same problem again, with other data. And this time I'm sure it's not a data related problem.
Can somebody help me?
Thank you
UPDATE:
The problem does not seem to be a perfect classification
> xtabs(~status+tecnologia,data=dados)
tecnologia
status conv doppler gps
0 39 6 24
1 30 3 34
> xtabs(~status+pais,data=dados)
pais
status ES PT
0 71 8
1 49 28
> xtabs(~tecnologia+pais,data=dados)
pais
tecnologia ES PT
conv 69 0
doppler 1 8
gps 30 28
Here's a simple example which seems to reproduce your problem:
> library(survival)
> (df1 <- data.frame(t1=seq(1:6),
s1=rep(c(0, 1), 3),
te1=c(rep(0, 3), rep(1, 3)),
pa1=c(0,0,1,0,0,0)
))
t1 s1 te1 pa1
1 1 0 0 0
2 2 1 0 0
3 3 0 0 1
4 4 1 1 0
5 5 0 1 0
6 6 1 1 0
> (coxph(Surv(t1, s1) ~ te1*pa1, data=df1))
Call:
coxph(formula = Surv(t1, s1) ~ te1 * pa1, data = df1)
coef exp(coef) se(coef) z p
te1 -23 9.84e-11 58208 -0.000396 1
pa1 -23 9.84e-11 100819 -0.000229 1
te1:pa1 NA NA 0 NA NA
Now lets look for 'perfect classification' like so:
> (xtabs( ~ s1+te1, data=df1))
te1
s1 0 1
0 2 1
1 1 2
> (xtabs( ~ s1+pa1, data=df1))
pa1
s1 0 1
0 2 1
1 3 0
Note that a value of 1 for pa1 exactly predicts having a status s1 equal to 0. That is to say, based on your data, if you know that pa1==1 then you can be sure than s1==0. Thus fitting Cox's model is not appropriate in this setting and will result in numerical errors.
This can be seen with
> coxph(Surv(t1, s1) ~ pa1, data=df1)
giving
Warning message:
In fitter(X, Y, strats, offset, init, control, weights = weights, :
Loglik converged before variable 1 ; beta may be infinite.
It's important to look at these cross tables before fitting models. Also it's worth starting with simpler models before considering those involving interactions.
If we add the interaction term to df1 manually like this:
> (df1 <- within(df1,
+ te1pa1 <- te1*pa1))
t1 s1 te1 pa1 te1pa1
1 1 0 0 0 0
2 2 1 0 0 0
3 3 0 0 1 0
4 4 1 1 0 0
5 5 0 1 0 0
6 6 1 1 0 0
Then check it with
> (xtabs( ~ s1+te1pa1, data=df1))
te1pa1
s1 0
0 3
1 3
We can see that it's a useless classifier, i.e. it does not help predict status s1.
When combining all 3 terms, the fitter does manage to produce a numerical value for te1 and pe1 even though pe1 is a perfect predictor as above. However a look at the values for the coefficients and their errors shows them to be implausible.
Edit #JMarcelino: If you look at the warning message from the first coxph model in the example, you'll see the warning message:
2: In coxph(Surv(t1, s1) ~ te1 * pa1, data = df1) :
X matrix deemed to be singular; variable 3
Which is likely the same error you're getting and is due to this problem of classification. Also, your third cross table xtabs(~ tecnologia+pais, data=dados) is not as important as the table of status by interaction term. You could add the interaction term manually first as in the example above then check the cross table. Or you could say:
> with(df1,
table(s1, pa1te1=pa1*te1))
pa1te1
s1 0
0 3
1 3
That said, I notice one of the cells in your third table has a zero (conv, PT) meaning you have no observations with this combination of predictors. This is going to cause problems when trying to fit.
In general, the outcome should be have some values for all levels of the predictors and the predictors should not classify the outcome as exactly all or nothing or 50/50.
Edit 2 #user75782131 Yes, generally speaking xtabs or a similar cross-table should be performed in models where the outcome and predictors are discrete i.e. have a limited no. of levels. If 'perfect classification' is present then a predictive model / regression may not be appropriate. This is true for example for logistic regression (outcome is binary) as well as Cox's model.

Resources