format of goodness of fit table in binomial regression in r - r

I tried doing a goodness of fit test for binomial regression that I did and get this results:
goodness of fit result
in the example my teacher gave the table was
row = 0 1
column =0 1
while mine is
column = 1 0
row = 0 1
as seen in the image above
does this difference matter in the results I get?

The results won't change. But if you like you can change the order of the columns using
table()[,2:1]

Related

Testing and adjusting for autocorrelation / serial correlation

Unfortunately im not able to provide a reproducible example, but hopefully you get the idea regardless.
I am conducting some regression analyses where the dependent variable is a DCC of a pair of return series - two stocks. Im using dummies to represent shocks in the return series, i.e. the worst 1% of observed returns. In sum:
DCC = c + 1%Dummy
When I run the DurbinWatsonTest I get the output:
Autocorrelation: 0,9987
D-W statistic: 0
p-value: 0
HA: rho !=0
Does this just mean that its highly significant presence of autocorrelation?
I also tried dwtest, but that yields NA values for both P and DW-stat.
To correct for autocorrealtion I used the code:
spx10 = lm(bit_sp500 ~ Spx_0.1)
spx10_hc = coeftest(spx10, vcov. = vcovHC(spx10, method = "arellano",type = "HC3"))
How can I be certain that it had any effect, as I cannot run the DW-test for the spx10_hc, nor did the regression output change noteworthy. Is it common that regression analysis with 1 independent variable changes just ever so slightly when adjusting for autocorrelation?

Table function omits bottom row

I am trying to make a confusion matrix for a logistic regression model and no matter what I do, the table function leaves out the bottom row. In terms of demonstrating my work, I can provide sample data below. My real data is very bulky and is downloaded onto my computer. However, the problem is that this simulation, which is based on my exact code, functions properly. The difference might be that this is a very small simulation. My real data has 20,000+ rows.
set.seed(1)
a<-runif(10)
b<-runif(10)
c<-rnorm(10)
sample_outcome<-sample(c(0,1), replace=TRUE, size=10)
sample.df<-data.frame(a,b,c,sample_outcome)
s_logistic <- glm(formula = sample_outcome~., data=sample.df, family=binomial)
s_probs <- predict(s_logistic, type="response")
s_predict <-rep(0,nrow(sample.df))
s_predict[s_probs>.5]=1
table(s_predict,sample.df$sample_outcome)
s_predict 0 1
0 1 0
1 2 7
In my actual data, the bottom row, the one that corresponds to "1,2,7" is ALWAYS missing. Any idea what might be going on here?

credit score from SVM probabilities in R

I am trying to calculate credit scores for the germancredit dataframe in R. I used linear SVM classifier to predict 0 and 1 (i.e. 0 = good, 1 = bad).
I managed to produce probabilities from SVM classifier using the following code.
final_pred = predict(classifier, newdata = data_treated[1:npredictors], decision.values = TRUE, probability = TRUE)
probs = attr(final_pred,"probabilities")
I want to know how to read these probabilities output. The sample output is here.
Does the following output mean that, if the prediction is 1 (Default) in fifth row, then probability is 0.53601166.
0 1 Prediction
1 0.90312125 0.09687875 0
2 0.57899408 0.42100592 0
3 0.93079172 0.06920828 0
4 0.76600082 0.23399918 0
5 0.46398834 0.53601166 1
Can I then use the above respective probabilities to develop credit scorecard like we usually do with logistic regression model
You get a probability for outcome 0 or 1. The first two columns for each row sum to one and give you the overall probability. Your interpretation seems correct to me, i.e. with a probability of 0.53 it is more likely that a default will happen, than the probability of no default happening with p = 0.46.
Yes, you could use that model for developing a credit scorecard. Please mind, that you don't necessarily need to use 0.5 as your cutoff value for deciding if company or person X is going to default.

Handling error in function, notify then skip

I am having trouble process a function that can both output regression summary to csv files, and process regression analysis. So the code looks like this:
I have three predicting variables:
age1 (continuous), gender1 (categorical 0/1), FLUSHOT(categorical 0/1)
In the file, the first 100 columns are response variables (all categorical 0/1) I want to test.
The goal is to do regression analysis with each of the response variables(1:100), and only output p-value, OR, and CI.
So the code I have is something looks like this:
fun1<-function(x){
res<-c(paste(as.character(summary(x)$call),collapse = " "),
summary(x)$coefficients[4,4],
exp(coef(x))[4],
exp(confint(x))[4,1:2],"\n")
names(res)<-c("call","p-value","OR","LCI","UCI","")
return(res)}
res2=NULL
lms=list()
for(i in 1:100)
{
lms[[i]]=glm(A[,i]~age1+gender1+as.factor(FLUSHOT),family="binomial",data=A)
res2<-rbind(res2,fun1(lms[[i]]))
}
write.csv(res2,"A_attempt1.csv",row.names=F)
If for example, we have sufficient sample size in each categories, or if the marginal frequency looks like this:
table(variable1,FLUESHOT)
0 1
0 15 3
1 11 19
This code works well, but if we have something like:
table(variable15,FLUESHOT)
0 1
0 15 0
1 11 19
The code run into a error, report, and stops.
I tried multiple ways of using try() and tryCatch(), but didn't seems to work for me.
What error message do you see? You can try using lrm from the rms package to estimate the logistic regression model. And texreg to output it to csv.

Classification table for logistic regression in R

I have a data set consisting of a dichotomous depending variable (Y) and 12 independent variables (X1 to X12) stored in a csv file. Here are the first 5 rows of the data:
Y,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12
0,9,3.86,111,126,14,13,1,7,7,0,M,46-50
1,7074,3.88,232,4654,143,349,2,27,18,6,M,25-30
1,5120,27.45,97,2924,298,324,3,56,21,0,M,31-35
1,18656,79.32,408,1648,303,8730,286,294,62,28,M,25-30
0,3869,21.23,260,2164,550,320,3,42,203,3,F,18-24
I constructed a logistic regression model from the data using the following code:
mydata <- read.csv("data.csv")
mylogit <- glm(Y~X1+X2+X3+X4+X5+X6+X7+X8+X9+X10+X11+X12, data=mydata,
family="binomial")
mysteps <- step(mylogit, Y~X1+X2+X3+X4+X5+X6+X7+X8+X9+X10+X11+X12, data=mydata,
family="binomial")
I can obtain the predicted probabilities for each data using the code:
theProbs <- fitted(mysteps)
Now, I would like to create a classification table--using the first 20 rows of the data table (mydata)--from which I can determine the percentage of the predicted probabilities that actually agree with the data. Note that for the dependent variable (Y), 0 represents probability that is less than 0.5, and 1 represents probability that is greater than 0.5.
I have spent many hour trying to construct the classification without success. I would appreciate it very much if someone suggest code that can help to solve this problem.
Question is a bit old, but I figure if someone is looking though the archives, this may help.
This is easily done by xtabs
classDF <- data.frame(response = mydata$Y, predicted = round(fitted(mysteps),0))
xtabs(~ predicted + response, data = classDF)
which will produce a table like this:
response
predicted 0 1
0 339 126
1 130 394
I think 'round' can do the job here.
table(round(theProbs))

Resources