credit score from SVM probabilities in R - r

I am trying to calculate credit scores for the germancredit dataframe in R. I used linear SVM classifier to predict 0 and 1 (i.e. 0 = good, 1 = bad).
I managed to produce probabilities from SVM classifier using the following code.
final_pred = predict(classifier, newdata = data_treated[1:npredictors], decision.values = TRUE, probability = TRUE)
probs = attr(final_pred,"probabilities")
I want to know how to read these probabilities output. The sample output is here.
Does the following output mean that, if the prediction is 1 (Default) in fifth row, then probability is 0.53601166.
0 1 Prediction
1 0.90312125 0.09687875 0
2 0.57899408 0.42100592 0
3 0.93079172 0.06920828 0
4 0.76600082 0.23399918 0
5 0.46398834 0.53601166 1
Can I then use the above respective probabilities to develop credit scorecard like we usually do with logistic regression model

You get a probability for outcome 0 or 1. The first two columns for each row sum to one and give you the overall probability. Your interpretation seems correct to me, i.e. with a probability of 0.53 it is more likely that a default will happen, than the probability of no default happening with p = 0.46.
Yes, you could use that model for developing a credit scorecard. Please mind, that you don't necessarily need to use 0.5 as your cutoff value for deciding if company or person X is going to default.

Related

Testing and adjusting for autocorrelation / serial correlation

Unfortunately im not able to provide a reproducible example, but hopefully you get the idea regardless.
I am conducting some regression analyses where the dependent variable is a DCC of a pair of return series - two stocks. Im using dummies to represent shocks in the return series, i.e. the worst 1% of observed returns. In sum:
DCC = c + 1%Dummy
When I run the DurbinWatsonTest I get the output:
Autocorrelation: 0,9987
D-W statistic: 0
p-value: 0
HA: rho !=0
Does this just mean that its highly significant presence of autocorrelation?
I also tried dwtest, but that yields NA values for both P and DW-stat.
To correct for autocorrealtion I used the code:
spx10 = lm(bit_sp500 ~ Spx_0.1)
spx10_hc = coeftest(spx10, vcov. = vcovHC(spx10, method = "arellano",type = "HC3"))
How can I be certain that it had any effect, as I cannot run the DW-test for the spx10_hc, nor did the regression output change noteworthy. Is it common that regression analysis with 1 independent variable changes just ever so slightly when adjusting for autocorrelation?

How do you organize data for and run multinomial probit in R?

I apologize for the "how do I run this model in R" question. I will be the first to admit that i am a newbie when it comes to statistical models. Hopefully I have enough substantive questions surrounding it to be interesting, and the question will come out more like, "Does this command in R correspond to this statistical model?"
I am trying to estimate a model that can estimate the probability of a given Twitter user "following" a political user from a given political party. My dataframe is at the level of individual users, where each user can choose to follow or not follow a party on Twitter. As alternative-specific variables i have measures of ideological distance from the Twitter user and the political party and an interaction term that specifies whether the distance is positive or negative. Thus, the decision to follow a politician on twitter is a function of your ideological distance.
Initially i tried to estimate a conditional logit model, but i quickly got away from that idea since the choices are not mutually exclusive i.e. they can choose to follow more than one party. Now i am in doubt whether i should employ a multinomial probit or a multivariate probit, since i want my model to allow indviduals to choose more than one alternative. However, when i try to estimate a multinomial probit, my code doesn't work. My code is:
mprobit <- mlogit(Follow ~ F1_Distance+F2_Distance+F1_Distance*F1_interaction+F2_Distance*F2_interaction+strata(id),
long, probit = T, seed = 123)
And i get the following error message:
Error in dfidx::dfidx(data = data, dfa$idx, drop.index = dfa$drop.index, :
the two indexes don't define unique observations
I've tried looking the error up, but i can't seem to find anything that relates to probit models. Can you tell me what i'm doing wrong? Once again, sorry for my ignorance. Thank you for your help.
Also, i've tried copying my dataframe in the code below. The data is for the first 6 observations for the first Twitter user, but i have a dataset of 5181 users, which corresponds to 51810 observations, since there's 10 parties in Denmark.
id Alternative Follow F1_Distance F2_Distance F1_interaction
1 1 alternativet 1 -0.9672566 -1.3101138 0
2 1 danskfolkeparti 0 0.6038972 1.3799961 1
3 1 konservative 1 1.0759252 0.8665096 1
4 1 enhedslisten 0 -1.0831657 -1.0815424 0
5 1 liberalalliance 0 1.5389934 0.8470291 1
6 1 nyeborgerlige 1 1.4139934 0.9898862 1
F2_interaction
1 0
2 1
3 1
4 0
5 1
6 1
>```

format of goodness of fit table in binomial regression in r

I tried doing a goodness of fit test for binomial regression that I did and get this results:
goodness of fit result
in the example my teacher gave the table was
row = 0 1
column =0 1
while mine is
column = 1 0
row = 0 1
as seen in the image above
does this difference matter in the results I get?
The results won't change. But if you like you can change the order of the columns using
table()[,2:1]

Correlation of categorical data to binomial response in R

I'm looking to analyze the correlation between a categorical input variable and a binomial response variable, but I'm not sure how to organize my data or if I'm planning the right analysis.
Here's my data table (variables explained below):
species<-c("Aaeg","Mcin","Ctri","Crip","Calb","Tole","Cfus","Mdes","Hill","Cpat","Mabd","Edim","Tdal","Tmin","Edia","Asus","Ltri","Gmor","Sbul","Cvic","Egra","Pvar")
scavenge<-c(1,1,0,1,1,1,1,0,1,0,1,1,1,0,0,1,0,0,0,0,1,1)
dung<-c(0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0)
pred<-c(0,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0)
nectar<-c(1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,1,0,0)
plant<-c(0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0)
blood<-c(1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0)
mushroom<-c(0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0)
loss<-c(0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0) #1 means yes, 0 means no
data<-cbind(species,scavenge,dung,pred,nectar,plant,blood,mushroom,loss)
data #check data table
data table explanation
I have individual species listed, and the next columns are their annotated feeding types. A 1 in a given column means yes, and a 0 means no. Some species have multiple feeding types, while some have only one feeding type. The response variable I am interested in is "loss," indicating loss of a trait. I'm curious to know if any of the feeding types predict or are correlated with the status of "loss."
thoughts
I wasn't sure if there was a good way to include feeding types as one categorical variable with multiple categories. I don't think I can organize it as a single variable with the types c("scavenge","dung","pred", etc...) since some species have multiple feeding types, so I split them up into separate columns and indicated their status as 1 (yes) or 0 (no). At the moment I was thinking of trying to use a log-linear analysis, but examples I find don't quite have comparable data... and I'm happy for suggestions.
Any help or pointing in the right direction is much appreciated!
There are too little samples, you have 4 loss == 0 and 18 loss == 1. You will run into problems fitting a full logistic regression (i.e including all variables). I suggest testing for association for each feeding habit using a fisher test:
library(dplyr)
library(purrr)
# function for the fisher test
FISHER <- function(x,y){
FT = fisher.test(table(x,y))
data.frame(
pvalue=FT$p.value,
oddsratio=as.numeric(FT$estimate),
lower_limit_OR = FT$conf.int[1],
upper_limit_OR = FT$conf.int[2]
)
}
# define variables to test
FEEDING <- c("scavenge","dung","pred","nectar","plant","blood","mushroom")
# we loop through and test association between each variable and "loss"
results <- data[,FEEDING] %>%
map_dfr(FISHER,y=data$loss) %>%
add_column(var=FEEDING,.before=1)
You get the results for each feeding habit:
> results
var pvalue oddsratio lower_limit_OR upper_limit_OR
1 scavenge 0.264251538 0.1817465 0.002943469 2.817560
2 dung 1.000000000 1.1582683 0.017827686 20.132849
3 pred 0.263157895 0.0000000 0.000000000 3.189217
4 nectar 0.535201640 0.0000000 0.000000000 5.503659
5 plant 0.002597403 Inf 2.780171314 Inf
6 blood 1.000000000 0.0000000 0.000000000 26.102285
7 mushroom 0.337662338 5.0498688 0.054241930 467.892765
The pvalue is p-value from fisher.test, basically with an odds ratio > 1, the variable is positively associated with loss. Of all your variables, plant is the strongest and you can check:
> table(loss,plant)
plant
loss 0 1
0 18 0
1 1 3
Almost all that are plant=1, are loss=1.. So with your current dataset, I think this is the best you can do. Should get a larger sample size to see if this still holds.

Calculate AUC for test set (keras model in R)

Is there a way (function) to calculate AUC value for a keras model in R on test-set?
I have searched on google but nothing shown up.
From Keras model, we can extract the predicted values as either class or probability as follows:
Probability:
[1,] 9.913518e-01 1.087829e-02
[2,] 9.990101e-01 1.216531e-03
[3,] 9.445553e-01 6.256607e-02
[4,] 9.928864e-01 6.808311e-03
[5,] 9.993126e-01 1.028240e-03
[6,] 6.075442e-01 3.926141e-01
Class:
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
Many thanks,
Ho
Generally it does not really matter what calssifier (keras or not) did the prediction. All you need to estimate the AUC are two things: the predicted probabilities from some classifier and the actual category (for example, dead "yes" vs. "no"). With this data you can calculate both, True Positive Rate and False positive rate, thus you can also make a ROC plot and estimate AUC with this data. You can use
library(pROC)
roc_obj <- roc(category, prediction)
auc(roc_obj)
See here for some more explanation.
I'm not sure this will answer your needs as it depends on your data structure and keras output format, but have a look to Dismo package's function evaluate. You need to set up something like that:
library(dismo)
predictors <- stack of explaining variables
pres_test <- a subset of data used to model ##that you not use in your model for this testing purpose
backg_test <- true or random (background) absence data
model <- output of your model
AUC <- evaluate(pres_test, backg_test, model, predictors) ## you may bootstrap this step x time by randomly selecting 'pres_test' and 'backg_test' x time.

Resources