Confusion matrix with a four-level class in R - r

I am trying to get a confusion matrix from a multi-level factor variable (Rating)
My data looks like this:
> head(credit)
Income Rating Cards Age Education Gender Student Married Ethnicity Balance
1 14.891 bad 2 34 11 Male No Yes Caucasian 333
2 106.025 excellent 3 82 15 Female Yes Yes Asian 903
3 104.593 excellent 4 71 11 Male No No Asian 580
4 148.924 excellent 3 36 11 Female No No Asian 964
5 55.882 good 2 68 16 Male No Yes Caucasian 331
6 80.180 excellent 4 77 10 Male No No Caucasian 1151
I built a classification tree with the rpart() function then predicted probabilities.
credit_model <- rpart(Rating ~ ., data=credit_train, method="class")
credit_pred <- predict(credit_model, credit_test)
Then I want to assess the prediction with CrossTable() from the gmodels package.
library(gmodels)
CrossTable(credit_test, credit_pred, prop.chisq=FALSE, prop.c=FALSE, prop.r=FALSE, dnn=c("actual Rating", "predicted Rating"))
But I get this error:
Error in CrossTable(credit_test, credit_pred, prop.chisq = FALSE,
prop.c = FALSE, : x and y must have the same length
I don't know why I get this error for a 4-level class. When I have a binary class it works fine.

Related

Creating and plotting confidence intervals

I have fitted a gaussian GLM model to my data, i now wish to create 95% CIs and fit them to my data. Im having a couple of issues with this when plotting as i cant get them to capture my data, they just seem to plot the same line as the model without captuing the data points. Also Im also unsure that I've created my CIs the correct way here for the mean. I entered my data and code below if anyone knows how to fix this
data used
aids
cases quarter date
1 2 1 83.00
2 6 2 83.25
3 10 3 83.50
4 8 4 83.75
5 12 1 84.00
6 9 2 84.25
7 28 3 84.50
8 28 4 84.75
9 36 1 85.00
10 32 2 85.25
11 46 3 85.50
12 47 4 85.75
13 50 1 86.00
14 61 2 86.25
15 99 3 86.50
16 95 4 86.75
17 150 1 87.00
18 143 2 87.25
19 197 3 87.50
20 159 4 87.75
21 204 1 88.00
22 168 2 88.25
23 196 3 88.50
24 194 4 88.75
25 210 1 89.00
26 180 2 89.25
27 277 3 89.50
28 181 4 89.75
29 327 1 90.00
30 276 2 90.25
31 365 3 90.50
32 300 4 90.75
33 356 1 91.00
34 304 2 91.25
35 307 3 91.50
36 386 4 91.75
37 331 1 92.00
38 368 2 92.25
39 416 3 92.50
40 374 4 92.75
41 412 1 93.00
42 358 2 93.25
43 416 3 93.50
44 414 4 93.75
45 496 1 94.00
my code used to create the model and intervals before plotting
#creating the model
model3 = glm(cases ~ date,
data = aids,
family = poisson(link='log'))
#now to add approx. 95% confidence envelope around this line
#predict again but at the linear predictor level along with standard errors
my_preds <- predict(model3, newdata=data.frame(aids), se.fit=T, type="link")
#calculate CI limit since linear predictor is approx. Gaussian
upper <- my_preds$fit+1.96*my_preds$se.fit #this might be logit not log
lower <- my_preds$fit-1.96*my_preds$se.fit
#transform the CI limit to get one at the level of the mean
upper <- exp(upper)/(1+exp(upper))
lower <- exp(lower)/(1+exp(lower))
#plotting data
plot(aids$date, aids$cases,
xlab = 'Date', ylab = 'Cases', pch = 20)
#adding CI lines
plot(aids$date, exp(my_preds$fit), type = "link",
xlab = 'Date', ylab = 'Cases') #add title
lines(aids$date,exp(my_preds$fit+1.96*my_preds$se.fit),lwd=2,lty=2)
lines(aids$date,exp(my_preds$fit-1.96*my_preds$se.fit),lwd=2,lty=2)
outcome i currently get with no data points, the model is correct here but the CI isnt as i have no data points, so the CIs are made incorrectly i think somewhere
Edit: Response to OP's providing full data set.
This started out as a question about plotting data and models on the same graph, but has morphed considerably. You seem you have an answer to the original question. Below is one way to address the rest.
Looking at your (and my) plots it seems clear that poisson glm is just not a good model. To say it differently, the number of cases may vary with date, but is also influenced by other things not in your model (external regressors).
Plotting just your data suggests strongly that you have at least two and perhaps more regimes: time frames where the growth in cases follows different models.
ggplot(aids, aes(x=date)) + geom_point(aes(y=cases))
This suggests segmented regression. As with most things in R, there is a package for that (more than one actually). The code below uses the segmented package to build successive poisson glm using 1 breakpoint (two regimes).
library(data.table)
library(ggplot2)
library(segmented)
setDT(aids) # convert aids to a data.table
aids[, pred:=
predict(
segmented(glm(cases~date, .SD, family = poisson), seg.Z = ~date, npsi=1),
type='response', se.fit=TRUE)$fit]
ggplot(aids, aes(x=date))+ geom_line(aes(y=pred))+ geom_point(aes(y=cases))
Note that we need to tell segmented the count of breakpoints, but not where they are - the algorithm figures that out for you. So here, we see a regime prior to 3Q87 which is well modeled using poission glm, and a regime after that which is not. This is a fancy way of saying that "something happened" around 3Q87 which changed the course of the disease (at least in this data).
The code below does the same thing but for between 1 and 4 breakpoints.
get.pred <- \(p.n, p.DT) {
fit <- glm(cases~date, p.DT, family=poisson)
seg.fit <- segmented(fit, seg.Z = ~date, npsi=p.n)
predict(seg.fit, type='response', se.fit=TRUE)[c('fit', 'se.fit')]
}
gg.dt <- rbindlist(lapply(1:4, \(x) { copy(aids)[, c('pred', 'se'):=get.pred(x, .SD)][, npsi:=x] } ))
ggplot(gg.dt, aes(x=date))+
geom_ribbon(aes(ymin=pred-1.96*se, ymax=pred+1.96*se), fill='grey80')+
geom_line(aes(y=pred))+
geom_point(aes(y=cases))+
facet_wrap(~npsi)
Note that the location of the first breakpoint does not seem to change, and also that, notwithstanding the use of the poisson glm the growth appears linear in all but the first regime.
There are goodness-of-fit metrics described in the package documentation which can help you decide how many break points are most consistent with your data.
Finally, there is also the mcp package which is a bit more powerful but also a bit more complex to use.
Original Response: Here is one way that builds the model predictions and std. error in a data.table, then plots using ggplot.
library(data.table)
library(ggplot2)
setDT(aids) # convert aids to a data.table
aids[, c('pred', 'se', 'resid.scale'):=predict(glm(cases~date, data=.SD, family=poisson), type='response', se.fit=TRUE)]
ggplot(aids, aes(x=date))+
geom_ribbon(aes(ymin=pred-1.96*se, ymax=pred+1.96*se), fill='grey80')+
geom_line(aes(y=pred))+
geom_point(aes(y=cases))
Or, you could let ggplot do all the work for you.
ggplot(aids, aes(x=date, y=cases))+
stat_smooth(method = glm, method.args=list(family=poisson))+
geom_point()

R: Names of features not appearing when using table()

I am using the following command that returns this output:
> table(data$Smoke, data$Gender)
female male
no 314 334
yes 44 33
Nonetheless, in the tutorial I'm watching, the instructor uses the same line of code and they get
Gender
Smoke female male
no 314 334
yes 44 33
How can I achieve this result? It's not clear from the help menu.
Just pass a two-column data.frame object to table()
table(data[c("Smoke", "Gender")])
# Gender
# Smoke female male
# no 29 31
# yes 17 23
or use xtabs():
xtabs( ~ Smoke + Gender, data)
# Gender
# Smoke female male
# no 29 31
# yes 17 23
Although the following one also works, it looks some rude.
table(Smoke = data$Smoke, Gender = data$Gender)
Data
data <- data.frame(id = 1:100,
Smoke = sample(c("no", "yes"), 100, T),
Gender = sample(c("female", "male"), 100, T))
You can name the vectors you pass to table.
table(Smoke = c('no','yes'), Gender = c('male','female'))
#-----
Gender
Smoke female male
no 0 1
yes 1 0

MICE package in R: passive imputation

I aimed to handle missing values with multiple imputation and then analyse with mixed linear model.
I am stacked by passive imputation for "BMI" (body mass index) and "BMI category". "BMI" was calculated by height and weight and then categorized into "BMI category".
How to impute 'BMI category'?
The database looks like below:
sub_eu_surf[1:5, 3:12]
age gender smoking exercise education sbp dbp height weight bmi
1 41 1 1 2 18 120 80 185 107 31.26370
2 46 1 3 2 18 130 70 182 102 30.79338
3 46 1 3 2 18 130 70 182 102 30.79338
4 47 1 1 2 14 130 80 178 78 24.61810
5 47 1 1 1 14 150 80 175 85 27.75510
Since 'bmi category' is not a predictor of my imputation, I decided to create it after imputation. And details are below:
1. To define method and predictor
ini<-mice(sub_eu_surf, maxit=0)
meth<-ini$meth
meth["bmi"]<-"~I(weight/(height/100)^2)"
pred <- ini$predictorMatrix
pred[c("pm25_global", "pm25_eu", "pm10_eu", "no2_eu"), ]<-0
pred[,c("bmi", "hba1c", "pm25_eu", "pm10_eu")]<-0
pred[,"tc"]<-0
pred[c("smoking", "exercise", "hdl", "glucose"), "tc"]<-1
pred[c("smoking", "exercise", "hdl", "glucose"), "ldl"]<-0
vis <- ini$vis
imp_eu<-mice(sub_eu_surf, meth=meth, pred=pred, vis=vis, seed=200, print=F, m=5, maxit=5)
long_eu<- complete(imp_eu, "long", include=TRUE)
long_eu$bmi_category<-cut(as.numeric(long_eu$bmi), breaks=c(0, 18.5, 25, 30, 72))
complete_eu<-as.mids(long_eu)
But I received an error when analyzing my data:
test1<-with(imp_eu, lme(sbp~pm25_global+gender+age+education+bmi_category, random=~1|centre))
Error in eval(expr, envir, enclos) : object 'bmi_category' not found
How does this happen?
You are running your analyses on the original mids object imp_eu, not on the modified complete_eu. Try:
test1<-with(complete_eu, lme(sbp~pm25_global+gender+age+education+bmi_category, random=~1|centre))

Plotting estimated probabilities from binary logistic regression when one or more predictor variables are held constant

I am a biology grad student who has been spinning my wheels for about thirty hours on the following issue. In summary I would like to plot a figure of estimated probabilities from a glm binary logistic regression model i produced. I have already gone through model selection, validation, etc and am now simply trying to produce figures. I had no problem plotting probability curves for the model i selected but what i am really interested in is producing a figure that shows probabilities of a binary outcome for a predictor variable when the other predictor variable is held constant.
I cannot figure out how to assign this constant value to only one of the predictor variables and plot the probability for the other variable. Ultimately i would like to produce figures similar to the crude example i attached desired output. I admit I am a novice in R and I certainly appreciate folks' time but i have exhausted online searches and have yet to find the approach or a solution adequately explained. This is the closest information related to my question but i found the explanation vague and it failed to provide an example for assigning one predictor a constant value while plotting the probability of the other predictor. https://stat.ethz.ch/pipermail/r-help/2010-September/253899.html
Below i provided a simulated dataset and my progress. Thank you very much for your expertise, i believe a solution and code example would be helpful for other ecologists who use logistic regression.
The simulated dataset shows survival outcomes over the winter for lizards. The predictor variables are "mass" and "depth".
x<-read.csv('logreg_example_data.csv',header = T)
x
survival mass depth
1 0 4.294456 262
2 0 8.359857 261
3 0 10.740580 257
4 0 10.740580 257
5 0 6.384678 257
6 0 6.384678 257
7 0 11.596380 270
8 0 11.596380 270
9 0 4.294456 262
10 0 4.294456 262
11 0 8.359857 261
12 0 8.359857 261
13 0 8.359857 261
14 0 7.920406 258
15 0 7.920406 258
16 0 7.920406 261
17 0 10.740580 257
18 0 10.740580 258
19 0 38.824960 262
20 0 9.916840 239
21 1 6.384678 257
22 1 6.384678 257
23 1 11.596380 270
24 1 11.596380 270
25 1 11.596380 270
26 1 23.709520 288
27 1 23.709520 288
28 1 23.709520 288
29 1 38.568970 262
30 1 38.568970 262
31 1 6.581013 295
32 1 6.581013 298
33 1 0.766564 269
34 1 5.440803 262
35 1 5.440803 262
36 1 19.534710 252
37 1 19.534710 259
38 1 8.359857 263
39 1 10.740580 257
40 1 38.824960 264
41 1 38.824960 264
42 1 41.556970 239
#Dataset name is x
# time to run the glm model
model1<-glm(formula=survival ~ mass + depth, family = "binomial", data=x)
model1
summary(model1)
#Ok now heres how i predict the probability of a lizard "Bob" surviving the winter with a mass of 32.949 grams and a burrow depth of 264 mm
newdata<-data.frame(mass = 32.949, depth = 264)
predict(model1, newdata, type = "response")
# the lizard "Bob" has a 87.3% chance of surviving the winter
#Now lets assume the glm. model was robust and the lizard was endangered,
#from all my research I know the average burrow depth is 263.9 mm at a national park
#lets say i am also interested in survival probabilities at burrow depths of 200 and 100 mm, respectively
#how do i use the valuable glm model produced above to generate a plot
#showing the probability of lizards surviving with average burrow depths stated above
#across a range of mass values from 0.0 to 100.0 grams??????????
#i know i need to use the plot and predict functions but i cannot figure out how to tell R that i
#want to use the glm model i produced to predict "survival" based on "mass" when the other predictor "depth" is held at constant values of biological relevance
#I would also like to add dashed lines for 95% CI

Logistic regression with NAs and factors returns error

I met the following two major problems when running logistic regression:
My X variables includes factor variables, such as immigrant status (immigrant, non-immigrant); my Y variable is a binomial variable, low birth weight (non-lbw, lbw).
I run the following R script (I am using plsRglm package):
library(plsRglm)
model.plsrglm <- plsRglm(yair, xair, 3, modele="pls-glm-logistic")
1) If I do not drop all the NA values in y or x, R returns this:
summary(model.plsrglm)
Call
plsRglmmodel.default(dataY = yair, dataX = xair, nt = 6,
modele = "pls-glm-logistic")
> model.plsrglm
Number of required components:
NULL
Number of successfully computed components:
NULL
Coefficients:
NULL
Information criteria and Fit statistics:
NULL
2) If I do drop all the NA values before running the model, R gives an error:
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
So should I drop all NA value before generating the model?
And should I make the factor variable into numeric? If so, how should I do that, just by using as.numeric? But wouldn't it imply a level between non-immigrant and immigrant?
And for the Y variable, should I recode it as 0 and 1?
I added a reproducible dataset as below.
outcome c1 c2 c3 c4
1 lbw 120 yes <30 good
2 lbw 124 yes <30 good
3 lbw 125 yes <30 good
4 lbw 135 yes <30 good
5 lbw 112 yes <30 good
6 lbw 168 yes <30 good
7 lbw 147 yes 30-40 good
8 lbw 174 yes 30-40 fair
9 lbw 153 yes 30-40 fair
10 lbw 145 yes 30-40 fair
11 lbw 145 yes 30-40 fair
12 lbw 125 no >40 fair
13 lbw 125 no >40 poor
14 lbw 111 no >40 poor
15 non-lbw 80 no >40 poor
16 non-lbw 85 no >40 poor
17 non-lbw 78 yes >40 poor
18 non-lbw 67 no >40 poor
xair <- bc1997[,c("c1","c2","c3","c4")]
yair <- bc1997[,"outcome"]
model.plsrglm <- plsRglm(yair, xair, 2, modele="pls-glm-logistic")
summary(model.plsrglm)
But I got this error:
> model.plsrglm <- plsRglm(yair, xair, 2, modele="pls-glm-logistic")
____************************************************____
Family: binomial
Link function: logit
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
Your 'x' terms must be numeric. Your variables "c2", "c3", and "c4" are all class logistic or factor.
The default setting for scaleX is TRUE, it is using colMeans() in order to scale your predictors. However, this is not possible with factors. Therefore, you can either convert each column to numeric or specify scaleX=FALSE.

Resources