What is the interpretation of different formula in Multinom? - r

I want to run a multinomial logit regression using the multinom() function from the nnet R package. I need some help with the interpretation of different formulas. My dataset have 3 IV (age (3 levels) , presonality(4 levels)and Test(2 levels))an 1 DV (response (4 levels)) and all variables are categorical.Here an example of the dataset.
>head(df)
Personality Age Test Response
1 1 50+ pre A
2 1 50+ post A
3 2 30-39 pre B
4 2 50+ post C
5 3 20-29 pre D
6 3 20-29 post A
First,I want to understand the effect of personality on users' response in different test. I wrote two formulas (find them below) but I do not know what are the different between them and how their results will be interpreted differently.
Eq1= multinom(formula= response~ test + personality, data=df)
Eq2= multinom(formula= response~ test * personality, data=df)
Second, I want to understand how the result can be reported in paper?
Thanks for your help.

Related

How do you organize data for and run multinomial probit in R?

I apologize for the "how do I run this model in R" question. I will be the first to admit that i am a newbie when it comes to statistical models. Hopefully I have enough substantive questions surrounding it to be interesting, and the question will come out more like, "Does this command in R correspond to this statistical model?"
I am trying to estimate a model that can estimate the probability of a given Twitter user "following" a political user from a given political party. My dataframe is at the level of individual users, where each user can choose to follow or not follow a party on Twitter. As alternative-specific variables i have measures of ideological distance from the Twitter user and the political party and an interaction term that specifies whether the distance is positive or negative. Thus, the decision to follow a politician on twitter is a function of your ideological distance.
Initially i tried to estimate a conditional logit model, but i quickly got away from that idea since the choices are not mutually exclusive i.e. they can choose to follow more than one party. Now i am in doubt whether i should employ a multinomial probit or a multivariate probit, since i want my model to allow indviduals to choose more than one alternative. However, when i try to estimate a multinomial probit, my code doesn't work. My code is:
mprobit <- mlogit(Follow ~ F1_Distance+F2_Distance+F1_Distance*F1_interaction+F2_Distance*F2_interaction+strata(id),
long, probit = T, seed = 123)
And i get the following error message:
Error in dfidx::dfidx(data = data, dfa$idx, drop.index = dfa$drop.index, :
the two indexes don't define unique observations
I've tried looking the error up, but i can't seem to find anything that relates to probit models. Can you tell me what i'm doing wrong? Once again, sorry for my ignorance. Thank you for your help.
Also, i've tried copying my dataframe in the code below. The data is for the first 6 observations for the first Twitter user, but i have a dataset of 5181 users, which corresponds to 51810 observations, since there's 10 parties in Denmark.
id Alternative Follow F1_Distance F2_Distance F1_interaction
1 1 alternativet 1 -0.9672566 -1.3101138 0
2 1 danskfolkeparti 0 0.6038972 1.3799961 1
3 1 konservative 1 1.0759252 0.8665096 1
4 1 enhedslisten 0 -1.0831657 -1.0815424 0
5 1 liberalalliance 0 1.5389934 0.8470291 1
6 1 nyeborgerlige 1 1.4139934 0.9898862 1
F2_interaction
1 0
2 1
3 1
4 0
5 1
6 1
>```

Predicting survival within time (cumulative hazard) [duplicate]

This question already has an answer here:
Extract survival probabilities in Survfit by groups
(1 answer)
Closed 3 years ago.
By using R, how can one develop an index score for predicting patient overall survival (OS)?
I have a shortlist of 4 candidate predictors that showed to be associate with OS. They resulted from Cox multivariate regression (run with coxph()). The predictors are protein levels, hence they are all continuous variables.
The data table looks something like this (showing only n=10 here):
days Status Prot1 Prot13 Prot7 Prot21
Subj_1 115.69 0 2.284498 6.319168 6.070115 8.457412
Subj_2 72.30 1 2.473034 6.066573 6.140178 8.225987
Subj_3 1.08 1 2.662481 6.212845 6.971018 8.128949
Subj_4 69.63 1 2.761391 5.902610 6.433883 7.876319
Subj_5 78.41 1 3.038122 6.355257 6.852981 7.500973
Subj_6 42.90 1 2.058549 6.020681 7.231307 8.164025
Subj_7 31.00 1 2.305096 5.415107 8.126941 8.566320
Subj_8 51.12 1 2.931978 5.574601 7.503275 7.529957
Subj_9 11.01 1 2.218814 6.270222 6.710297 8.193895
Subj_10 27.68 1 2.821947 6.132379 6.911071 8.428218
The question is: How can I create a formula which is capable to classify these patients into 2 groups: a group where the estimated survival is <60% in a 1-year period, and another which will include those with estimated survival> 60% in the same time period?
Would there be any function() in R that deals with that?
Thanks a lot in advance.
I think you should post this question here
https://stats.stackexchange.com
since it is a matter of statistics. Anyway, you could try with a binomial regression to start, but there are many other models you could try. how many subjects do you have?

How to use the predict() function in the R package "pscl" with categorical predictor variables

I'm fitting count data (number of fledgling birds produced per territory) using zero-inflated poisson models in R, and while model fitting is working fine, I'm having trouble using the predict function to get estimates for multiple values of one category (Year) averaged over the values of another category (StudyArea). Both variables are dummy coded (0,1) and are set up as factors. The data frame sent to the predict function looks like this:
Year_d StudyArea_d
1 0 0.5
2 1 0.5
However, I get the error message:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
If instead I use a data frame such as:
Year_d StudyArea_d
1 0 0
2 0 1
3 1 0
4 1 1
I get sensible estimates of fledgling counts per year and study site combination. However, I'm not really interested in the effect of study site (the effect is small and isn't involved in an interaction), and the year effect is really what the study was designed to examine.
I have previously used similar code to successfully get estimated counts from a model that had one categorical and one continuous predictor variable (averaging over the levels of the dummy-coded factor), using a data frame similar to:
VegHeight StudyArea_d
1 0 0.5
2 0.5 0.5
3 1 0.5
4 1.5 0.5
So I'm a little confused why the first attempt I describe above doesn't work.
I can work on constructing a reproducible example if it would help, but I have a hunch that I'm not understand something basic about how the predict function works when dealing with factors. If anyone can help me understand what I need to do to get estimates at both levels of one factor, and averaged over the levels of another factor, I would really appreciate it.

Three factor logistic regression with interactions

I have a three factor contigency table that explores the association between committed crimes, Shoplifting or other theft acts here, gender and prior convictions on the one hand and lenient setences on the other. Lenient senteces is the response variable here and is binary ,1 for receiving a lenient sentence, 0 otherwise.
Crime Gender Priorconv Yes No
1 Shoplifting Men N 24 1
2 Other Theft Acts Men N 52 9
3 Shoplifting Women N 48 3
4 Other Theft Acts Women N 22 2
5 Shoplifting Men P 17 6
6 Other Theft Acts Men P 60 34
7 Shoplifting Women P 15 6
8 Other Theft Acts Women P 4 3
You can recreate the table using these commands
table1<-expand.grid(Crime=factor(c("Shoplifting","Other Theft Acts")),Gender=factor(c("Men","Women")),
Priorconv=factor(c("N","P")))
table1<-data.frame(table1,Yes=c(24,52,48,22,17,60,15,4),No=c(1,9,3,2,6,34,6,3))
I have been trying to run a logistic regression but quickly ran into trouble when I tried to include interactions between my variables. The glm works perfectly without the interactions. The code I have been using is
fit<-glm(cbind(Yes,No)~Crime+Gender+Priorconv+I(Crime*Priorconv),data=table1,family=binomial)
and the error I have been getting
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
In addition: Warning message:
In Ops.factor(Crime, Priorconv) : * not meaningful for factors
Could you please tell how I could deal with this error?
Thank you
By specifying I(Crime*Priorconv) you are asking R to compute the value Crime*Priorconv, which it refuses to do (because it doesn't make sense to multiply factors). If Crime and Priorconv were already numeric dummy variables (e.g. 0/1 coding with 0=shoplifting, 1=other and 0=N, 1=P) then it would make sense to multiply them, and you would use the I() notation to indicate that you wanted to multiply them.
Otherwise (if you don't use I()), R will interpret * as "interaction plus all lower-order effects", i.e. Crime*Priorconv corresponds to 1+Crime+Priorconv+Crime:Priorconv (where : denotes the interaction). R would automatically handle the redundancies (i.e. the fact that you have already specified main effects of Crime and Priorconv): in a formula context, including redundant main effects and explicitly including the intercept (1) or not are all equivalent. These formulae will all specify the same model:
1+Crime+Priorconv+Crime:Priorconv
Crime+Priorconv+Crime*Priorconv
Crime+Priorconv+Crime:Priorconv
Crime*Priorconv
but I prefer the last one: as #J.R. points out in his answer you can take advantage of the * notation to express your model more compactly.
You can use x:y in the formula to specify interactions between x and y, eg.:
fit<-glm(cbind(Yes,No)~Crime+Gender+Priorconv+Crime:Priorconv,data=table1,family=binomial)
or a little shorter:
fit<-glm(cbind(Yes,No)~Gender+Crime*Priorconv,data=table1,family=binomial)

R: how to estimate a fixed effects model with weights

I would like to run a fixed-effects model using OLS with weighted data.
Since there can be some confusion, I mean to say that I used "fixed effects" here in the sense that economists usually imply, i.e. a "within model", or in other words individual-specific effects. What I actually have is "multilevel" data, i.e. observations of individuals, and I would like to control for their region of origin (and have corresponding clustered standard errors).
Sample data:
library(multilevel)
data(bhr2000)
weight <- runif(length(bhr2000$GRP),min=1,max=10)
bhr2000 <- data.frame(bhr2000,weight)
head(bhr2000)
GRP AF06 AF07 AP12 AP17 AP33 AP34 AS14 AS15 AS16 AS17 AS28 HRS RELIG weight
1 1 2 2 2 4 3 3 3 3 5 5 3 12 2 6.647987
2 1 3 3 3 1 4 3 3 4 3 3 3 11 1 6.851675
3 1 4 4 4 4 3 4 4 4 2 3 4 12 3 8.202567
4 1 3 4 4 4 3 3 3 3 3 3 4 9 3 1.872407
5 1 3 4 4 4 4 4 3 4 2 4 4 9 3 4.526455
6 1 3 3 3 3 4 4 3 3 3 3 4 8 1 8.236978
The kind of model I would like to estimate is:
AF06_ij = beta_0 + beta_1 AP34_ij + alpha_1 * (GRP == 1) + alpha_2 * (GRP==2) +... + e_ij
where i refer to specific indidividuals and j refer to the group they belong to.
Moreover, I would like observations to be weighted by weight (sampling weights).
However, I would like to get "clustered standard errors", to reflect possible GRP-specific heteroskedasticity. In other words, E(e_ij)=0 but Var(e_ij)=sigma_j^2 where the sigma_j can be different for each GRP j.
If I understood correctly, nlme and lme4 can only estimate random-effects models (or so-called mixed models), but not fixed-effects model in the sense of within.
I tried the package plm, which looked ideal for what I wanted to do, but it does not allow for weights. Any other idea?
I think this is more of a stack exchange question, but aside from fixed effects with model weights; you shouldn't be using OLS for an ordered categorical response variable. This is an ordered logistic modeling type of analysis. So below I use the data you have provided to fit one of those.
Just to be clear we have an ordered categorical response "AF06" and two predictors. The first one "AP34" is also an ordered categorical variable; the second one "GRP" is your fixed effect. So generally you can create a group fixed effect by coercing the variable in question to a factor on the RHS...(I'm really trying to stay away from statistical theory because this isn't the place for it. So I might be inaccurate in some of the things I'm saying)
The code below fits an ordered logistic model using the polr (proportional odds logistic regression) function. I've tried to interpret what you were going for in terms of model specification, but at the end of the day OLS is not the right way forward. The call to coefplot will have a very crowded y axis I just wanted to present a very rudimentary start at how you might interpret this. I'd try to visualize this in a more refined way for sure. And back to interpretation...You will need to work on that, but I think this is generally the right method. The best resource I can think of is chapters 5 and 6 of "Data Analysis Using Regression and Multilevel/Hierarchical Models" by Gelman and Hill. It's such a good resource so I'd really recommend you read the whole thing and try to master it if you're interested in this type of analysis going forward.
library(multilevel) # To get the data
library(MASS) # To get the polr modeling function
library(arm) # To get the tools, insight and expertise of Andrew Gelman and his team
# The data
weight <- runif(length(bhr2000$GRP),min=1,max=10)
bhr2000 <- data.frame(bhr2000,weight)
head(bhr2000)
# The model
m <- polr(factor(AF06) ~ AP34 + factor(GRP),weights = weight, data = bhr2000, Hess=TRUE, method = "logistic")
summary(m)
coefplot(m,cex.var=.6) # from the arm package
Check out the lfe package---it does econ style fixed effects and you can specify clustering.

Resources