How to loop glm inside a R user-defined function? - r

Context: I am trying to predict survival using Titanic disaster dataset from Kaggle.
Dataset train4 used in the code below.
PassengerId Pclass Sex Age SibSp Parch Fare Cabin Sex_F Survived
1 3 male 22 1 0 7.25 0 0
2 1 female 38 1 0 71.2833 C85 1 1
3 3 female 26 0 0 7.925 1 1
4 1 female 35 1 0 53.1 C123 1 1
5 3 male 35 0 0 8.05 0 0
7 1 male 54 0 0 51.8625 E46 0 0
I am trying to run n times glm on each of 3 models. I have wrapped these 3 models inside one function run_models. Apparently, function gets successfully defined but when I input n in run_models(10) it doesn't run (throws error: In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == : prediction from a rank-deficient fit may be misleading).
BUT it runs if I run directly the for(i in 1:n) part by replacing n by 10. The code can become more flexible if I can automate it without manually adding "n" in each occurence.
Here's the complete code:
library("dplyr")
library("ggplot2")
library("scales")
train<-read.csv("train.csv")
test<-read.csv("test.csv")
attach(train)
#Filtering relevant informations
train2<-train[complete.cases(train),]
train3<-train2[,-c(4,9)]
train3<-train3[,c(1,3:9,2)]
#Dummy variables for Sex
train4<-mutate(train3,Sex_F = (train3$Sex == "female")*(train3$Sex == "female"))
#Massaging final dataset
train4<-train4[,c(1:8,10,9)]
#Fitting logistic regression
fit1<-glm(Survived~., data = train4, family = binomial(link='logit'))
fit2<-glm(Survived~ Pclass + Age + SibSp + Parch + Fare + Sex_F, data = train4, family = binomial(link='logit'))
fit3<-glm(Survived~ Pclass + Sex_F, data = train4, family = binomial(link='logit'))
models_summary<-data.frame(Model = numeric(),Accuracy = numeric(),Deviance = numeric())
run_models<-function(n){
for(i in 1:n)
{
#Making some dataset to check accuracy
#Model1
check1<-sample_n(train4,100)
check_fit1<-data.frame(round(predict(fit1, newdata = data.frame(check1[,-c(10)]),type="response"),0))
colnames(check_fit1)<-c("Survival_predicted")
final1<-cbind(check1[,c(1,10)],check_fit1[,1])
colnames(final1)<-c("Passenger ID","Survived","Survival_predicted")
accuracy1<-1 - nrow(final1[which(final1$Survived!=final1$Survival_predicted),])/nrow(final1)
#Model2
check2<-sample_n(train4[,-c(3,8)],100)
check_fit2<-data.frame(round(predict(fit2, newdata = data.frame(check2[,-c(8)]),type="response"),0))
colnames(check_fit2)<-c("Survival_predicted")
final2<-cbind(check2[,c(1,8)],check_fit2[,1])
colnames(final2)<-c("Passenger ID","Survived","Survival_predicted")
accuracy2<-1 - nrow(final2[which(final2$Survived!=final2$Survival_predicted),])/nrow(final2)
#Model3
check3<-sample_n(train4[,c(1,2,9,10)],100)
check_fit3<-data.frame(round(predict(fit3, newdata = data.frame(check3[,c(1:3)]),type="response"),0))
colnames(check_fit3)<-c("Survival_predicted")
final3<-cbind(check3[,c(1,4)],check_fit3[,1])
colnames(final3)<-c("Passenger ID","Survived","Survival_predicted")
accuracy3<-1 - nrow(final3[which(final3$Survived!=final3$Survival_predicted),])/nrow(final3)
#Summary
models_summary_TEMP<-data.frame(c("1","2","3"),c(accuracy1,accuracy2,accuracy3),c(fit1$deviance,fit2$deviance,fit3$deviance))
colnames(models_summary_TEMP)<-c("Model","Accuracy","Deviance")
models_summary<-rbind(models_summary,models_summary_TEMP)
models_summary_TEMP<-data.frame(Model = numeric(),Accuracy = numeric(),Deviance = numeric())
}
}
run_models(10)

Related

How can I solve the issue that VIF doesn't work after added more variables?

I created ordered logit models using polr in RStudio and the using the vif. It worked fine and I got my results.
But after I decided toadd more variables such as dummy for age groups, education and income groups. But when I then use vif I got the error about subscript out of bound
My used data for this:
Here is an example of my data set:
Work less
lifestatisfied
country
Work much
0
8
GB
1
1
8
SE
0
0
9
DK
1
0
9
DE
1
NA
5
NO
NA
continued:
health
education
income
age
marital status
3
3
Na
61
NA
4
2
2
30
NA
1
3
4
39
6
5
7
5
52
4
4
1
5
17
5
gender is dummy 1 or 2
age is respondents age like 35, 47 etc.
income is scaled and is 1 to 10
educ (education) is 1 to 7
health is scaled 1 to 5
work less is dummy i.e. 1 or 0
work much is dummy, i.e. 1 or 0
marital status is scaled 1 to 6
lifesatisfied is the dependent variable and is scaled 0 to 10.
My ordered regression model:
myorderedmodel = polr(factor(lifesatisfied, ordered = TRUE) ~ maritalstatus + gender + age + age20_29 + age30_39 + age40_49 + age50_59 + income + lowincome + avgincome + highincome + noeducation + loweducation + higheducation + health + child + religion + workless + workmuch, data = mydata, method = "logistic", Hess = TRUE)
vif(myorderedmodel)
Gives the following error:
Error in R[subs, subs] : subscript out of bounds
I really didn't understand the error. What does it mean? And how can it be solved?

R Error in model.frame.default(formula, data) : variable lengths differ (found for 'age')

I use my training data for the SVM model is successful, but when I try to tune the hyperparameters it shows out this error
I am new to ML and R please help me with it thank you so much
#svm model
svm_model= svm(svm_train$y ~.,data = svm_train,kernel = "radial",cost = 1,gamma = 1/ncol(svm_train),type="C-classification")
str(svm_model)
summary(svm_model)
#
#Parameters:
# SVM-Type: C-classification
#SVM-Kernel: radial
#cost: 1
#Number of Support Vectors: 1844
#( 889 955 )
#Number of Classes: 2
#Levels:
# 1 2
#tune
tObj<-tune.svm(svm_train$y ~.,data = svm_train,gamma = c(0.1,0.5,1,5,2,3,4,5),cost = c(0.5,1,5,10,100,1000),type="C-classification",kernel = "radial")
Error in model.frame.default(formula, data) :
variable lengths differ (found for 'age')
my svm_train is a no null data frame as below
> svm_train[1,]
age job marital education default housing loan contact month day_of_week duration campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx
24020 40 10 1 4 2 1 2 1 10 2 144 1 999 1 1 -0.1 93.798 -40.4
euribor3m nr.employed y
24020 4.968 5195.8 1
>

How to convert the values of a column of a data frame to 0s and 1s?

I am working on a negative binomial regression model, which predicts the number of initiated private member bills by MPs in Japan based on their voteshare, age, sex and parliamentary office. In order to calculate the AME for the parliamentary_office variable I need to create two new data frames df.0 and df.1. As an example, here's the data frame df.0:
(Intercept) voteshare age sexmale term parliamentary_office
1 1 37.92 57 male 0 0
2 1 45.99 65 male 5 0
3 1 36.18 59 female 3 0
4 1 43.3 47 male 1 0
5 1 45.48 58 male 5 0
6 1 31.89 44 male 0 0
How to convert the the sexmale column to numbers?
Here is my code:
#rm(list=ls())
library(foreign)
dat <- read.dta(file = 'activity.dta', convert.factors = FALSE)
dat_clear <- na.omit(dat)
datc_2012 <- dat_clear[dat_clear$election == 2012, ]
library(MASS)
summary(m2.negbin <- glm.nb(num_pmbs_initiated ~
voteshare + age + sex + term
+ parliamentary_office, data = datc_2012,
link = "log"))
df.0 <- data.frame(cbind(1,
m2.negbin[["model"]]$voteshare,
m2.negbin[["model"]]$age,
m2.negbin[["model"]]$sex,
m2.negbin[["model"]]$term,
m2.negbin[["model"]]$parliamentary_office))
colnames(df.0) <- names(coef(m2.negbin))
df.1 <- df.0
df.0[,"parliamentary_office"] <- 0
df.1[,"parliamentary_office"] <- 1
you can use the ifelse function if you only have only male and female and need to change the column into 0-1 numbers.
df.0$sexmale <- sapply(df.0$sexmale, function (x) ifelse(x == "male", 0, 1))

Average Marginal Effects in R with complex interaction terms

I am using R to compute the linear regression on the following model, as well as find the marginal effects of age on pizza at specific points (20,30,40,50,55).
mod6.22c <- lm(pizza ~ age + income + age*income +
I((age*age)*income), data = piz4)
The problem I am running into is when using the margins command, R does not see interaction terms that are inserted into the lm with I((age x age) x income). The margins command will only produce accurate average marginal effects when the interaction terms are in the form of variable1 x variable1. I also can't create a new variable in my table table$newvariable <- table$variable1^2, because the margins command won't identify newvariable as related to variable1.
This has been fine up until now, where my interaction terms have only been a quadratic, or an xy interaction, but now I am at a point where I need to calculate the average marginal effects with the interaction term AGE^2xINCOME included in the model, but the only way I can seem to get the summary lm output to be correct is by using I(age^2*(income)) or by creating a new variable in my table. As stated before, the margins command can't read I(age^2*(income)), and if I create a new variable, the margins command doesn't recognize the variables are related, and the average marginal effects produced are incorrect.
The error I am receiving:
> summary(margins(mod6.22c, at = list(age= c(20,30,40,50,55)),
variables = "income"))
Error in names(classes) <- clean_terms(names(classes)) :
'names' attribute [4] must be the same length as the vector [3]
I appreciate any help in advance.
Summary of data:
Pizza is annual expenditure on pizza, female, hs, college and grad are dummy variables, income is in thousands of dollars per year, age is years old.
> head(piz4)
pizza female hs college grad income age agesq
1 109 1 0 0 0 19.5 25 625
2 0 1 0 0 0 39.0 45 2025
3 0 1 0 0 0 15.6 20 400
4 108 1 0 0 0 26.0 28 784
5 220 1 1 0 0 19.5 25 625
6 189 1 1 0 0 39.0 35 1225
Libraries used:
library(data.table)
library(dplyr)
library(margins)
tldr
This works:
mod6.22 <- lm(pizza ~ age + income + age*income, data = piz4)
**summary(margins(mod6.22, at = list(age= c(20,30,40,50,55)), variables = "income"))**
factor age AME SE z p lower upper
income 20.0000 4.5151 1.5204 2.9697 0.0030 1.5352 7.4950
income 30.0000 3.2827 0.9049 3.6276 0.0003 1.5091 5.0563
income 40.0000 2.0503 0.4651 4.4087 0.0000 1.1388 2.9618
income 50.0000 0.8179 0.7100 1.1520 0.2493 -0.5736 2.2095
income 55.0000 0.2017 0.9909 0.2036 0.8387 -1.7403 2.1438
This doesn't work:
mod6.22c <- lm(pizza ~ age + income + age*income + I((age * age)*income), data = piz4)
**summary(margins(mod6.22c, at = list(age= c(20,30,40,50,55)), variables = "income"))**
Error in names(classes) <- clean_terms(names(classes)) :
'names' attribute [4] must be the same length as the vector [3]
How do I get margins to read my interaction variable I((age*age)*income)?

How to plot variable's predicted probability based on glm model

I would like to plot each of the variables that are part of the glm model, where the y axis is the predicted probability and the x axis is the variable levels or values.
Here is my code that I tried in order to do it:
The data:
dat <- read.table(text = "target apcalc admit num
0 0 0 21
0 0 1 24
0 1 0 55
0 1 1 72
1 0 0 5
1 0 1 31
1 1 0 11
1 1 1 3", header = TRUE)
The glm model:
f<-glm(target ~ apcalc + admit +num, data = dat,family=binomial(link='logit'))
The loop to present the desired plot:
for(i in 1:length(f$var.names)){
plot(predict(f,i.var.names=i,newdata=dat,type='response'))
}
I got a strange plot as an output ("Index" in the x axis and "predict(f,i.var.names=i,newdata=dat,type='response')" in the y axis. How can I fix my code in order to get the desired result?
(I don't the reputation yet in order to present it here)
Heres plotting all your variables with the predicted probability,
f<-glm(target ~ apcalc + admit +num, data=dat,family=binomial(link="logit"))
PredProb=predict(f,type='response') #predicting probabilities
par(mfrow=c(2,2))
for(i in names(dat)){
plot(dat[,i],PredProb,xlab=i)
}
On running the f<-glm(.....) part, f$var.names is giving NULL as output. There must be some error there.
f<-glm(target ~ apcalc + admit +num, data=dat,family=binomial("logit"))
f
Call: glm(formula = target ~ apcalc + admit + num, family = binomial("logit"),
data = dat)
Coefficients:
(Intercept) apcalc admit num
2.2690 3.1742 2.4406 -0.1721
Degrees of Freedom: 7 Total (i.e. Null); 4 Residual
Null Deviance: 11.09
Residual Deviance: 5.172 AIC: 13.17
f$var.names
NULL

Resources