R: how to estimate a fixed effects model with weights - r

I would like to run a fixed-effects model using OLS with weighted data.
Since there can be some confusion, I mean to say that I used "fixed effects" here in the sense that economists usually imply, i.e. a "within model", or in other words individual-specific effects. What I actually have is "multilevel" data, i.e. observations of individuals, and I would like to control for their region of origin (and have corresponding clustered standard errors).
Sample data:
library(multilevel)
data(bhr2000)
weight <- runif(length(bhr2000$GRP),min=1,max=10)
bhr2000 <- data.frame(bhr2000,weight)
head(bhr2000)
GRP AF06 AF07 AP12 AP17 AP33 AP34 AS14 AS15 AS16 AS17 AS28 HRS RELIG weight
1 1 2 2 2 4 3 3 3 3 5 5 3 12 2 6.647987
2 1 3 3 3 1 4 3 3 4 3 3 3 11 1 6.851675
3 1 4 4 4 4 3 4 4 4 2 3 4 12 3 8.202567
4 1 3 4 4 4 3 3 3 3 3 3 4 9 3 1.872407
5 1 3 4 4 4 4 4 3 4 2 4 4 9 3 4.526455
6 1 3 3 3 3 4 4 3 3 3 3 4 8 1 8.236978
The kind of model I would like to estimate is:
AF06_ij = beta_0 + beta_1 AP34_ij + alpha_1 * (GRP == 1) + alpha_2 * (GRP==2) +... + e_ij
where i refer to specific indidividuals and j refer to the group they belong to.
Moreover, I would like observations to be weighted by weight (sampling weights).
However, I would like to get "clustered standard errors", to reflect possible GRP-specific heteroskedasticity. In other words, E(e_ij)=0 but Var(e_ij)=sigma_j^2 where the sigma_j can be different for each GRP j.
If I understood correctly, nlme and lme4 can only estimate random-effects models (or so-called mixed models), but not fixed-effects model in the sense of within.
I tried the package plm, which looked ideal for what I wanted to do, but it does not allow for weights. Any other idea?

I think this is more of a stack exchange question, but aside from fixed effects with model weights; you shouldn't be using OLS for an ordered categorical response variable. This is an ordered logistic modeling type of analysis. So below I use the data you have provided to fit one of those.
Just to be clear we have an ordered categorical response "AF06" and two predictors. The first one "AP34" is also an ordered categorical variable; the second one "GRP" is your fixed effect. So generally you can create a group fixed effect by coercing the variable in question to a factor on the RHS...(I'm really trying to stay away from statistical theory because this isn't the place for it. So I might be inaccurate in some of the things I'm saying)
The code below fits an ordered logistic model using the polr (proportional odds logistic regression) function. I've tried to interpret what you were going for in terms of model specification, but at the end of the day OLS is not the right way forward. The call to coefplot will have a very crowded y axis I just wanted to present a very rudimentary start at how you might interpret this. I'd try to visualize this in a more refined way for sure. And back to interpretation...You will need to work on that, but I think this is generally the right method. The best resource I can think of is chapters 5 and 6 of "Data Analysis Using Regression and Multilevel/Hierarchical Models" by Gelman and Hill. It's such a good resource so I'd really recommend you read the whole thing and try to master it if you're interested in this type of analysis going forward.
library(multilevel) # To get the data
library(MASS) # To get the polr modeling function
library(arm) # To get the tools, insight and expertise of Andrew Gelman and his team
# The data
weight <- runif(length(bhr2000$GRP),min=1,max=10)
bhr2000 <- data.frame(bhr2000,weight)
head(bhr2000)
# The model
m <- polr(factor(AF06) ~ AP34 + factor(GRP),weights = weight, data = bhr2000, Hess=TRUE, method = "logistic")
summary(m)
coefplot(m,cex.var=.6) # from the arm package

Check out the lfe package---it does econ style fixed effects and you can specify clustering.

Related

How do you organize data for and run multinomial probit in R?

I apologize for the "how do I run this model in R" question. I will be the first to admit that i am a newbie when it comes to statistical models. Hopefully I have enough substantive questions surrounding it to be interesting, and the question will come out more like, "Does this command in R correspond to this statistical model?"
I am trying to estimate a model that can estimate the probability of a given Twitter user "following" a political user from a given political party. My dataframe is at the level of individual users, where each user can choose to follow or not follow a party on Twitter. As alternative-specific variables i have measures of ideological distance from the Twitter user and the political party and an interaction term that specifies whether the distance is positive or negative. Thus, the decision to follow a politician on twitter is a function of your ideological distance.
Initially i tried to estimate a conditional logit model, but i quickly got away from that idea since the choices are not mutually exclusive i.e. they can choose to follow more than one party. Now i am in doubt whether i should employ a multinomial probit or a multivariate probit, since i want my model to allow indviduals to choose more than one alternative. However, when i try to estimate a multinomial probit, my code doesn't work. My code is:
mprobit <- mlogit(Follow ~ F1_Distance+F2_Distance+F1_Distance*F1_interaction+F2_Distance*F2_interaction+strata(id),
long, probit = T, seed = 123)
And i get the following error message:
Error in dfidx::dfidx(data = data, dfa$idx, drop.index = dfa$drop.index, :
the two indexes don't define unique observations
I've tried looking the error up, but i can't seem to find anything that relates to probit models. Can you tell me what i'm doing wrong? Once again, sorry for my ignorance. Thank you for your help.
Also, i've tried copying my dataframe in the code below. The data is for the first 6 observations for the first Twitter user, but i have a dataset of 5181 users, which corresponds to 51810 observations, since there's 10 parties in Denmark.
id Alternative Follow F1_Distance F2_Distance F1_interaction
1 1 alternativet 1 -0.9672566 -1.3101138 0
2 1 danskfolkeparti 0 0.6038972 1.3799961 1
3 1 konservative 1 1.0759252 0.8665096 1
4 1 enhedslisten 0 -1.0831657 -1.0815424 0
5 1 liberalalliance 0 1.5389934 0.8470291 1
6 1 nyeborgerlige 1 1.4139934 0.9898862 1
F2_interaction
1 0
2 1
3 1
4 0
5 1
6 1
>```

Bayesian Question: Exponential Prior and Poisson Likelihood: Posterior?

I am needing assistance in a particular question and need confirmation of my understanding.
The belief is that absences in a company follow
a Poisson(λ) distribution.
It is believed additionally that 75% of thes value of λ is less than 5 therefore it is decided that a exponential distribution will be prior for λ. You take a random sample of 50 students and find out the number of absences that each has had over the past semester.
The data summarised below, note than 0 and 1 are binned collectively.
Number of absences
≤ 1 2 3 4 5 6 7 8 9 10
Frequency
18 13 8 3 4 3 0 0 0 1
Therefore in order to calculate a posterior distribution, My understanding is that prior x Likelihood which is this case is a Exponential(1/2.56) and a Poisson with the belief incorporated that the probability of less than 5 is 0.75 which is solved using
-ln(1-0.75)/(1/2.56)= 3.5489.
Furthermore a similar thread has calculated the Posterior to be that of a Gamma (sum(xi)+1,n+lambda)
Therefore with those assumptions, I have some code to visualise this
x=seq(from=0, to=10, by= 1)
plot(x,dexp(x,rate = 0.390625),type="l",col="red")
lines(x,dpois(x,3.54890),col="blue")
lines(x,dgamma(x,128+1,50+3.54890),col="green")
Any help or clarification surround this would be greatly appreciated

how to find prediction error after finding prediction model [duplicate]

Given two simple sets of data:
head(training_set)
x y
1 1 2.167512
2 2 4.684017
3 3 3.702477
4 4 9.417312
5 5 9.424831
6 6 13.090983
head(test_set)
x y
1 1 2.068663
2 2 4.162103
3 3 5.080583
4 4 8.366680
5 5 8.344651
I want to fit a linear regression line on the training data, and use that line (or the coefficients) to calculate the "test MSE" or Mean Squared Error of the Residuals on the test data once that line is fit there.
model = lm(y~x,data=training_set)
train_MSE = mean(model$residuals^2)
test_MSE = ?
In this case, it is more precise to call it MSPE (mean squared prediction error):
mean((test_set$y - predict.lm(model, test_set)) ^ 2)
This is a more useful measure as all models aim at prediction. We want a model with minimal MSPE.
In practice, if we do have a spare test data set, we can directly compute MSPE as above. However, very often we don't have spare data. In statistics, the leave-one-out cross-validation is an estimate of MSPE from the training dataset.
There are also several other statistics for assessing prediction error, like Mallows's statistic and AIC.

What is the interpretation of different formula in Multinom?

I want to run a multinomial logit regression using the multinom() function from the nnet R package. I need some help with the interpretation of different formulas. My dataset have 3 IV (age (3 levels) , presonality(4 levels)and Test(2 levels))an 1 DV (response (4 levels)) and all variables are categorical.Here an example of the dataset.
>head(df)
Personality Age Test Response
1 1 50+ pre A
2 1 50+ post A
3 2 30-39 pre B
4 2 50+ post C
5 3 20-29 pre D
6 3 20-29 post A
First,I want to understand the effect of personality on users' response in different test. I wrote two formulas (find them below) but I do not know what are the different between them and how their results will be interpreted differently.
Eq1= multinom(formula= response~ test + personality, data=df)
Eq2= multinom(formula= response~ test * personality, data=df)
Second, I want to understand how the result can be reported in paper?
Thanks for your help.

R - Calculate Test MSE given a trained model from a training set and a test set

Given two simple sets of data:
head(training_set)
x y
1 1 2.167512
2 2 4.684017
3 3 3.702477
4 4 9.417312
5 5 9.424831
6 6 13.090983
head(test_set)
x y
1 1 2.068663
2 2 4.162103
3 3 5.080583
4 4 8.366680
5 5 8.344651
I want to fit a linear regression line on the training data, and use that line (or the coefficients) to calculate the "test MSE" or Mean Squared Error of the Residuals on the test data once that line is fit there.
model = lm(y~x,data=training_set)
train_MSE = mean(model$residuals^2)
test_MSE = ?
In this case, it is more precise to call it MSPE (mean squared prediction error):
mean((test_set$y - predict.lm(model, test_set)) ^ 2)
This is a more useful measure as all models aim at prediction. We want a model with minimal MSPE.
In practice, if we do have a spare test data set, we can directly compute MSPE as above. However, very often we don't have spare data. In statistics, the leave-one-out cross-validation is an estimate of MSPE from the training dataset.
There are also several other statistics for assessing prediction error, like Mallows's statistic and AIC.

Resources