Predicted Probability Calculations for Large Dataframe Following Regression - r

I have conducted a logistic regression on a binary dependent variable and 5 independent variables. The dataframe I drew these variables from is survey data asking whether a person has voted for or against a policy change (binary dependent variable), with the other variables being questions regarding their income, location and other such personal information that could inform whether they would vote for or against the vote.
Having conducted the regression, I'd now like to calculate the predicted probability that each person would have voted yes/no to see how informative those variables are. In total my dataframe has information on 3000 people and I'd like to calculate the predicted probability of voting for/against for every single row/person.
What methods are available for doing so?
Appreciate the help!

You can use the predict function in order to calculate the predicted probabilities.
predict(model, newdata, type="response")
With model our logistic regression (the result of the glm() function), newdata a dataset which contains all the variables defined in our model and for all the individuals for which you want a probability.

Related

Logistic regression without any outcome data

I am trying to perform logistic regression on data that contains a binary outcome. However, I do not have access to the outcome data.
I've calculated probabilities of a "1" outcome for each subject by assigning "risk points" to certain values of each variable and adding them up for each subject, so that the probability of a "1" is (sum of subject's risk points) / (total number of possible risk points). I then took the log of the odds ratio to calculate the logit, so I have a list of logit values between -3 and 2 for each subject.
However, I would like to use logistic regression to evaluate which variables have the greatest effect on the outcome probabilities. Is there a way in R to perform logistic regression using only the predictive variables and logit, without the binary outcome data? I have tried using glm and it does not work, because in order to do logisitic regression you need binary outcome data.
Thank you!

Firth's Penalised Logistic Regression - high chi-squared values

I am analysing a household budget survey dataset with the aim of analysing whether households who spend more on alcohol also spend more on other discretionary items such as restaurants and entertainment (large sample size over 200,000).
Given the large number of households reporting zero expenditure for each of these items, I had non-normally distributed errors in my linear regression model and therefore used a logistic regression.
When I ran my logistic regression I came across quasi-complete separation. Based on an analysis of the literature, it seems a Firth’s penalised logistic regression was the most appropriate:
Regression <- logistf(restaurant_spender ~ alc_spender + income_quintiles + eduation_hh, data = alcohol, weights = weight, firth=FALSE)
Where:
restaurant_spender is binary (=1 if they spend anything on restaurants and 0 otherwise)
alc_spender same as above but for alcohol
income_quintiles is a categorical variable separating households into one of five income quintiles
education_hh is a categorical variable indicating the highest level of education for the household head.
And to get the odds ratios:
exp(coef(Regression))
This produces odds ratio I would expect and my confidence intervals make sense. However, my Chi-squared values are all infinite.
I have tabbed all of my independent variables against my dependent variable and there are no categories with 0 (in fact, they are evenly distributed).
My questions are:
1) Am I doing anything obviously wrong in running a Firth’s penalised logistic regression in R?
2) Are infinite chi-squared values implausible?
3) Is there some other way in R to test why I am getting quasi-separation apart from tabbing independent and dependent variables?
Any help would be greatly appreciated.

how to get the coefficients for a dependent continues categorical variable

I have a question regarding regression analysis in R. It would be nice if you could please help. I have a dependent continuous categorical variable. This variable is actually the number of citations received by three groups of papers: female-authored papers, male-authored papers, female-male-authored papers. I have also several co-variates which could be categorical or continuous. My factor variable is a continuous variable.
With the kind of data that I have (over dispersed and have an excess of zeros), hurdle regression model seems to be the best model. I can run the regression model in R using threes libraries of MASS, pscl and AER. The only problem that I have is how to get the coefficients for all three groups of papers, so that I will be able to compare them.

Bootstrap: Factors affecting erosion

I have a problem deriving a meaningful generalized linear regression model. The predictor variable depends on several predictands of which some are factors, and other vectors (elevation, slope, depth to permafrost (vector), exposure (rank, vector), permeability (factor)). There are seven groups of the data which correspond to their geographic location. The predictands within these groups are always equal, so that only the predicted variable (erosion rate) changes. There are 1700 data points.
I would like to set up a generalized linear regression model to test the influence of the predictands on the erosion rate. As the data are spatially autocorrelated, the analysis should take 30 samples from each group, then run the glm, let's say 10,000 times. I would like to compare the spread of the coefficients. Any help is highly appreciated!

Logistic Regression Model & Multicolinearity of Categorical Variables in R

I have a training dataset that has 3233 rows and 62 columns. The independent variable is Happy (train$Happy), which is a binary variable. The other 61 columns are categorical independent variables.
I've created a logistic regression model as follows:
logModel <- glm(Happy ~ ., data = train, family = binary)
However, I want to reduce the number of independent variables that go into the model, perhaps down to 20 or so. I would like to start by getting rid of colinear categorical variables.
Can someone shed some light on how to determine which categorical variables are colinear and what threshold that I should use when removing a variable from a model?
Thank you!
if your variables were categorical then the obvious solution would be penalized logistic regression (Lasso) in R it is implemented in glmnet.
With categorical variables the problem is much more difficult.
I was in a similar situation and I used the importance plot from the package random forest in order to reduce the number of variables.
This would not help you to find collinearity but only to rank the variables by importance.
You have only 60 variable and maybe you have a knowledge of the field so you can try to add to you model some variables that makes sense to you (like z=x1-x3 if you think that the value x1-x3 is important.) and then rank them according to a random forest model
You could use Cramer's V, or the related Phi or contingency coefficient (see a great paper at http://www.harding.edu/sbreezeel/460%20files/statbook/chapter15.pdf), to measure colinearity among categorical variables. If two or more categorical variables have a Cramer's V value close to 1, it means they're highly "correlated" and you may not need to keep all of them in your logistic regression model.

Resources