I would like to add a column with the X² (Chi-Square) as well as a column with the Exp(B) into my regression output. Is there any idea on how to do this? Many thanks in advance. For now i have calculated this manually for every model and variable, which is quite time consuming.
model_simple <- as.formula("completion_yesno ~ ac + ov + UCRate + FirstWeek + LastWeek + DayofWeekSu + DayofWeekMo + DayofWeekTu + DayofWeekWe + DayofWeekTh + DayofWeekFr + MonthofYearJan + MonthofYearFeb + MonthofYearMar + MonthofYearApr +MonthofYearMay+ MonthofYearJun + MonthofYearJul + MonthofYearAug + MonthofYearSep + MonthofYearOct + MonthofYearNov")
clog_simple1 = glm(model_simple,data=cllw,family = binomial(link = cloglog))
summary(clog_simple1)
Maybe you can elaborate on what you mean by chi-square and exp(B). You can do the below:
da <- MASS::Pima.tr
model <- glm(type ~ .,data=da,family = binomial(link = cloglog))
results <- data.frame(coefficients(summary(model)),check.names=FALSE)
# some random values
results$chisq = rchisq(nrow(results),1)
results$expB = exp(results$Estimate)
Or you can use tidy from broom:
library(broom)
results = tidy(model)
results$expB = exp(results$Estimate)
Related
reg laccidentsvso2 weakban strongban lpop lunemp permale2 lrgastax laccidentmv2 st1-st50 t1-t48 time stt1-stt50 [aweight=pop],cluster(state)
There is no direct equivalent of the Stata regress command in R, but the following code should produce the same results:
library(lmtest)
model1 <- lm(laccidentsvso2 ~ weakban + strongban + lpop + lunemp + permale2 + lrgastax + laccidentmv2 + st1-st50 + t1-t48 + time + stt1-stt50, data = rstata, weights = aweight)
model2 <- lm(laccidentsvso2 ~ weakban + strongban + lpop + lunemp + permale2 + lrgastax + laccidentmv2 + st1-st50 + t1-t48 + time + stt1-stt50 + cluster(state), data = rstata, weights = aweight)
coeftest(model1, model2)
I am using R on panel data to get a regression table. One of my control variables is not showing up in my regression table (namely: SPINDEX).
Does anyone know how to fix this?
# Descriptives and tables
#descriptive 1
dfOnlyInteresting = select(df, "amountAquired", "NumberDirectors", "AnnualReportDate", "GenderRatio", "AGE", "roa", "xrd", "SPINDEX", "aquisitionFin", "aqusitionsNot0", "CEOTenure", "compRatio", "laggedAquisition", "amountAquired_mean")
stargazer(dfOnlyInteresting, type = "html", title="Descriptive statistics", digits=1, out="descriptives.doc")
# Create a panel dataframe
df.p = pdata.frame(df, index = c("GVKEY" ,"AnnualReportDate"))
# Test for duplicate row names
occur = data.frame(table(row.names(df.p)))
duplicateRowNames = occur[occur$Freq > 1,]
# Table 2
#----------------------------------------------------------
# Define models
#----------------------------------------------------------
mdlA <- amountAquired ~ GenderRatio + CEOTenure + AGE + roa + factor(SPINDEX) + aquisitionFin + xrd + laggedAquisition
mdlB <- amountAquired ~ GenderRatio + CEOTenure + AGE + roa + factor(SPINDEX) + aquisitionFin + xrd + laggedAquisition + compRatio
mdlC <- amountAquired ~ GenderRatio + CEOTenure + AGE + roa + factor(SPINDEX) + aquisitionFin + xrd + laggedAquisition + compRatio*NumberDirectors + compRatio
Thanks!
How can i find out how many observations were used in a regression?
model_simple <- as.formula("completion_yesno ~ ac + ov + UCRate + FirstWeek + LastWeek + DayofWeekSu + DayofWeekMo + DayofWeekTu + DayofWeekWe + DayofWeekTh + DayofWeekFr + MonthofYearJan + MonthofYearFeb + MonthofYearMar + MonthofYearApr +MonthofYearMay+ MonthofYearJun + MonthofYearJul + MonthofYearAug + MonthofYearSep + MonthofYearOct + MonthofYearNov")
clog_simple1 = glm(model_simple,data=cllw,family = binomial(link = cloglog))
summary(clog_simple1)
I have tried the fitted command which did not result in a concrete number of observations N
Use the built in nobs() function
nobs(clog_simple1)
I am working on a data set and would like to do step wise logistic regression using some variables and to do so I am using the add1() function in R. A sample of the data set can be downloaded from the link here: https://drive.google.com/file/d/0B0N-Nc7kEi4bVjhDd1FDaEE5cEE/view?usp=sharing
I thereby fit a logistic regression using:
train <- read.csv('training.csv')
glm.model_step_1 <- glm(loan_status ~ acc_open_past_24mths + annual_inc + avg_cur_bal + bc_open_to_buy + delinq_2yrs + dti + inq_last_6mths + installment + int_rate + mo_sin_old_il_acct + mo_sin_old_rev_tl_op + mo_sin_rcnt_rev_tl_op + mo_sin_rcnt_tl + mort_acc + mths_since_last_delinq + mths_since_recent_bc + mths_since_recent_inq + num_accts_ever_120_pd + num_actv_bc_tl + num_actv_rev_tl + num_bc_tl + num_il_tl + num_op_rev_tl + num_tl_op_past_12m + pct_tl_nvr_dlq + percent_bc_gt_75 + pub_rec_bankruptcies + revol_bal + revol_util + term + total_acc + total_bc_limit + total_il_high_credit_limit + fico_mean + addr_state + emp_length + verification_status + Count_NA + Info_missing + Engineer + Teacher + Doctor + Professor + Manager + Director + Analyst + senior + lead + consultant + home_ownership_own + home_ownership_rent + purpose_debt_consolidation + purpose_medical + purpose_credit_card + purpose_other,
data = train,
family = binomial(link = 'logit'))
And use the add1() function to do a forward selection.
add1(glm.model_step_1, scope = train)
This code does not work. I get the below error:
Error in factor.scope(attr(terms1, "factors"), list(add = attr(terms2, :
upper scope has term ‘NA’ not included in model
Does anyone know how to solve this error?
A question asked previously on datascience.stackexchange (https://datascience.stackexchange.com/questions/11604/checking-regression-coefficients-stability) mentioned checking for NAs. There aren't any NAs in the data set and that can be confirmed by running sapply(train, function(x) sum(is.na(x))
The train dataset of #Jash Sash has some anomalous values inside which force read.csv to read some numerical variables as factors with many categories.
Anyway, I consider here a model with only few variables in order to show how to avoid the error message reported above.
Remember that the scope argument must be a "formula giving the terms to be considered for adding or dropping"; it cannot be a data.frame like in the code of #Jash Sash.
train <- read.csv('training.csv')
numeric <- apply(train,2,is.factor)
glm.model_step_1 <- glm(loan_status ~ acc_open_past_24mths + avg_cur_bal + bc_open_to_buy,
data = na.omit(train),
family = binomial(link = 'logit'))
add1(glm.model_step_1, scope=~.+delinq_2yrs+inq_last_6mths+int_rate)
The results is:
Model:
loan_status ~ acc_open_past_24mths + avg_cur_bal + bc_open_to_buy
Df Deviance AIC
<none> 1038.6 1046.6
delinq_2yrs 1 1037.9 1047.9
inq_last_6mths 1 1038.0 1048.0
int_rate 1 1038.0 1048.0
I am trying to plot the model predictions from a binary choice glm against the empirical probability using data from the titanic. To show differences across class and sex I am using faceting, but I have two things things I can't quite figure out. The first is that I'd like to restrict the loess curve to be between 0 and 1, but if I add the option ylim(c(0,1)) to the end of the plot, the ribbon around the loess curve gets cut off if one side of it is outside the bound. The second thing I'd like to do is draw a line from the minimum x-value (predicted probability from the glm) for each facet, to the maximum x-value (within the same facet) and y = 1 so as to show glm predicted probability.
#info on this data http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3info.txt
load(url('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.sav'))
titanic <- titanic3[ ,-c(3,8:14)]; rm(titanic3)
titanic <- na.omit(titanic) #probably missing completely at random
titanic$age <- as.numeric(titanic$age)
titanic$sibsp <- as.integer(titanic$sibsp)
titanic$survived <- as.integer(titanic$survived)
training.df <- titanic[sample(nrow(titanic), nrow(titanic) / 2), ]
validation.df <- titanic[!(row.names(titanic) %in% row.names(training.df)), ]
glm.fit <- glm(survived ~ sex + sibsp + age + I(age^2) + factor(pclass) + sibsp:sex,
family = binomial(link = "probit"), data = training.df)
glm.predict <- predict(glm.fit, newdata = validation.df, se.fit = TRUE, type = "response")
plot.data <- data.frame(mean = glm.predict$fit, response = validation.df$survived,
class = validation.df$pclass, sex = validation.df$sex)
require(ggplot2)
ggplot(data = plot.data, aes(x = as.numeric(mean), y = as.integer(response))) + geom_point() +
stat_smooth(method = "loess", formula = y ~ x) +
facet_wrap( ~ class + sex, scale = "free") + ylim(c(0,1)) +
xlab("Predicted Probability of Survival") + ylab("Empirical Survival Rate")
The answer to your first question is to use coord_cartesian(ylim=c(0,1)) instead of ylim(0,1); this is a moderately FAQ.
For your second question, there may be a way to do it within ggplot but it was easier for me to summarize the data externally:
g0 <- ggplot(data = plot.data, aes(x = mean, y = response)) + geom_point() +
stat_smooth(method = "loess") +
facet_wrap( ~ class + sex, scale = "free") +
coord_cartesian(ylim=c(0,1))+
labs(x="Predicted Probability of Survival",
y="Empirical Survival Rate")
(I shortened your code slightly by eliminating some default values and using labs.)
ss <- ddply(plot.data,c("class","sex"),summarise,minx=min(mean),maxx=max(mean))
g0 + geom_segment(data=ss,aes(x=minx,y=minx,xend=maxx,yend=maxx),
colour="red",alpha=0.5)