Confusion matrix for multinomial logistic regression & ordered logit - r

I would like to create confusion matrices for a multinomial logistic regression as well as a proportional odds model but I am stuck with the implementation in R. My attempt below does not seem to give the desired output.
This is my code so far:
CH <- read.table("http://data.princeton.edu/wws509/datasets/copen.dat", header=TRUE)
CH$housing <- factor(CH$housing)
CH$influence <- factor(CH$influence)
CH$satisfaction <- factor(CH$satisfaction)
CH$contact <- factor(CH$contact)
CH$satisfaction <- factor(CH$satisfaction,levels=c("low","medium","high"))
CH$housing <- factor(CH$housing,levels=c("tower","apartments","atrium","terraced"))
CH$influence <- factor(CH$influence,levels=c("low","medium","high"))
CH$contact <- relevel(CH$contact,ref=2)
model <- multinom(satisfaction ~ housing + influence + contact, weights=n, data=CH)
summary(model)
preds <- predict(model)
table(preds,CH$satisfaction)
omodel <- polr(satisfaction ~ housing + influence + contact, weights=n, data=CH, Hess=TRUE)
preds2 <- predict(omodel)
table(preds2,CH$satisfaction)
I would really appreciate some advice on how to correctly produce confusion matrices for my 2 models!

You can refer -
Predict() - Maybe I'm not understanding it
Here in predict() you need to pass unseen data for prediction.

Related

Reproduce Stacking Regression models coefficients

I built two datasets based on the following equations.
spend1 <- runif(100,5,40)
spend2 <- runif(100,30,45)
rev1 <- 2*spend1 + 3*spend2
total_spend<- spend1 + spend2
rev2 <- runif(100,75,120)
total_rev<- rev1 + rev2
After getting these figures, then I built regression model and let the regression to simulate the coefficients reversely.
model1 <- lm(total_rev~total_spend + rev2)
yhat <- total_spend*as.vector(model1$coefficients)[2] + model1$residuals
model2<- lm(yhat~spend1+spend2)[enter image description here][1]
summary(model2)
Why adding model1$residuals can help to simulate the correct coefficients for spend1 and spend2?
If I don't use it, spend1 and spend2 coefficients will be just the coefficient of total_spend from model1.
[1]: https://i.stack.imgur.com/fO74j.png

Why no t-scores or p-values from summary(glm) in Databricks?

I'm using Databricks with the SparkR package to build a glm model. Everything seems to run ok except when I run summary(lm1). Instead of getting Variable, Estimate, Std.Error, t-value & p-value (see pic below - this is what I'd expect to see, NOT what I'm getting), I just get the variable and estimate. The only thing I can think is that the data set is big enough (train1 is 12 million rows and test1 is 6 million rows) that all estimates have 0 p-values. Any other reasons this would happen??
library(SparkR)
rdf <- sql("select * from myTable") #read data
train1 <- rdf[rdf$ntile_3 != 1,] # split into test and train based on ntile in table
test1 <- rdf[rdf$ntile_3 == 1,]
vtu1 <- c('var1','var2','var3')
lm1 <- glm( target ~., train1[,c(vtu1,'target' )],family = 'gaussian')
pred1 <- predict(lm1, test1)
summary(lm1)
as you specify family = Gaussian in your model, your glm model seems to be equivalent to a standard linear regression model (analyzed by lm in R).
For an extensive answer to your question, see for example here: https://stats.stackexchange.com/questions/187100/interpreting-glm-model-output-assessing-quality-of-fit
If you specify your model using lm, you should get the output you want.

Validating a model and introducing a new predictor in glm

I am hitting my head against the computer...
I have a prediction model in R that goes like this
m.final.glm <- glm(binary_outcome ~ rcs(PredictorA, parms=kn.a) + rcs(PredictorB, parms=kn.b) + PredictorC , family = "binomial", data = train_data)
I want to validate this model on test_data2 - first by updating the linear predictor (lp)
train_data$lp <- predict(m.final.glm, train_data)
test_data2$lp <- predict(m.final.glm, test_data2)
lp2 <- predict(m.final.glm, test_data2)
m.update2.lp <- glm(binary_outcome ~ 1, family="binomial", offset=lp2, data=test_data2)
m.update2.lp$coefficients[1]
m.final.update2.lp <- m.final.glm
m.final.update2.lp$coefficients[1] <- m.final.update2.lp$coefficients[1] + m.update2.lp$coefficients[1]
m.final.update2.lp$coefficients[1]
p2.update.lp <- predict(m.final.update2.lp, test_data2, type="response")
This gets me to the point where I have updated the linear predictor, i.e. in the summary of the model only the intercept is different, but the coefficients of each predictor are the same.
Next, I want to include a new predictor (it is categorical, if that matters), PredictorD, into the updated model. This means that the model has to have the updated linear predictor and the same coefficients for Predictors A, B and C but the model also has to contain Predictor D and estimate its significance.
How do I do this? I will be very grateful if you could help me with this. Thanks!!!

issue with calculating precision and recall for SVM model in R

I use census-data to build logistic regression model and SVM model, first, I convert <=50K to 0, >50K to 1 to make the data binomial. I try to calculate precision and recall for both models and compare which model perform better? But table(test$salary,pred1 >0.5) for SVM model in r says < table of extent 2 x 0 >
Warning message:
In Ops.factor(pred1, 0.5) : ‘>’ not meaningful for factors
but similar code is working for logistic regression model. I am new to R software, I hope that I can get help from here.Thanks a lot. Any help is welcome.
I hope the question is clear enough.
#setwd("C:/Users/)
Censusdata <- read.csv(file="census-data.csv", header=TRUE, sep=",")
library("dplyr", lib.loc="~/R/win-library/3.4")
df <- Censusdata[,]
# convert <=50K to 0, >50K to 1
df$salary <- as.numeric(factor(df$salary))-1
head(df,10)
library(lattice)
library(ggplot2)
library(caret)
data <- Censusdata
indexes <- sample(1:nrow(data),size=0.7*nrow(data))
test <- data[indexes,]
train <- data[-indexes,]
#logistic regression model fit
model <- glm(salary ~ education.num + hours.per.week,family = binomial,data = test)
pred <- predict(model,data=train)
summary(model)
# calculate precision and recall
table(test$salary,pre >0.5)
# for SVM model
model1 <- svm(salary ~ education.num + hours.per.week,family = binomial, data=test)
pred1 <- predict(model1,data=train)
table(test$salary,pred1 >0.5)

how to carry out logistic regression and random forest to predict churn rate

I am using following dataset: http://www.sgi.com/tech/mlc/db/churn.data
And the variable description: http://www.sgi.com/tech/mlc/db/churn.names
Ii did preliminary coding but I am really not able to make out how to perform a logistic regression and Random Forest techniques to this data to predict the importance of variables and churn rate.
nm <- read.csv("http://www.sgi.com/tech/mlc/db/churn.names",
skip=4, colClasses=c("character", "NULL"), header=FALSE, sep=":")[[1]]
nm
dat <- read.csv("http://www.sgi.com/tech/mlc/db/churn.data", header=FALSE, col.names=c(nm, "Churn"))
dat
View(dat)
View(dat)
library(survival)
s <- with(dat, Surv(account.length, as.numeric(Churn)))
model <- coxph(s ~ total.day.charge + number.customer.service.calls, data=dat[, -4])
summary(model)
plot(survfit(model))
Also I am not able to figure out how to use the model that I built in my further analysis.
please help me.
Do you have any example code of what you're trying to do? What further analysis do you have planned? If you're just trying to run a logistic regression on the data, the general format is:
lr <- glm(Churn ~ international.plan + voice.mail.plan + number.vmail.messages
+ account.length, family = "binomial", data = dat)
Try help(glm) and help(randomForest)

Resources