I've been trying to estimate a pVAR using GMM on R using the package panelvar. I'm estimating a dynamic panel VAR with two-step GMM using first-differences.
I have a balanced panel with 378 with a group variable (id) and a time variable (year). I have 14 observations per group (unvaried) and 27 groups in total. In total, I have 120 instruments. I'm a bit concerned about the results of the Hansen J-test and I'm looking for some explanations: I have a Hansen J-test statistic of 0 with a p-value of 1. To my understanding, this would mean that the model is correctly specified. But the fact that the p-value is very high (1.000), it might mean that something deeper is going on.
In my estimation, I have 7 dependent variables, 2 exogenous variables, and I'm using 4 lagged instruments per dependent variable. Why is the p-value of the Hansen test very high?
Thanks in advance!
I have the following variables and if they were in wide format I would calculate something like
lm(happiness ~ personality_trait*condition)
But my data is in long format.
I suppose it will be a repeated measures model but I'm not sure. I considered Linear Mixed Models but I'm not sure if I understood and whether it is what I'm looking for.
Thanks a lot!
participant
personality_trait1
condition
happiness
1
10
animal
5
1
10
human
7
2
2
animal
3
2
2
human
4
3
5
animal
6
3
5
human
2
I think
library(lme4)
lmer(happiness ~ personality_trait*condition + (1|participant), data= ...)
should do it. This allows for a different intercept for each individual, drawn from a Gaussian distribution around the population mean intercept). In some situations you could also fit a random slopes model (different slope for each individual), but in this case it wouldn't make sense since you appear to have only two observations per individual (thus, estimates of variation in slope would be confounded with the residual variation: see here for an example).
Are your samples always in the order "animal, then human"? If not, you might want to add a subject-level fixed effect of order ...
I'm trying to measure heritability of a trait, flowering time (FT), for a set of data collected from a half-sib design. The data includes FT for each mother plant and 2 half siblings from that mother plant for ~150 different maternal lines (ML). Paternity is unknown.
I've tried:
Estimating heritability with a regression of the maternal FT and the mean sibling FT, and doubling the slope. This worked fine, and produced an estimate of 0.14.
Running an ANOVA and using the between ML variation to estimate additive genetic variance. Got the idea from slide 25 of this powerpoint and from this thread on within and between variance calculation
fit = lm(FT ~ ML, her)
anova(fit)
her is the dataset, which, in this case, only includes the half sib FT values (I excluded the mother FT values for this attempt at heritability)
From the ANOVA output I have have used the "ML" term mean square as the between ML variation, which is also equal to 1/4 of the additive genetic variance because coefficient of relatedness between half-sibs is 0.25. This value turned out to be 0.098. Also, by multiplying this by 4 I could get the additive genetic variance.
I have used the "residuals" mean square as all variability save for that accounted for by the "ML" term. So, all of variance minus 1/4 of additive genetic variance. This turned out to be 1.342.
And then attempted to calculate heritabilty as Va/Vp = (4*0.098)/(1.342 + 0.098) = 0.39
This is quite different from my slope estimate, and I'm not sure if my reasoning is correct.
I've tried things with the sommer and heritability packages of R but haven't had success using either for a half-sib design and haven't found an example of a half-sib design with either package.
Any suggestions?
everybody!
I have a response variable that counts sucessful days in a month and is distributed in a peculiar shape (see above). About 50% are zeros, and there is a heavy tail. Because of the overdispersion and the excess of zeros, I was advised to predict it with a Zero-Inflated Negative Binomial regression model.
However, no matter how significant a model I obtain, it reflects little of those distributing features (see below). For example, the peaks are always around 4, and no predictions fall beyond 20.
Is this usual in fitting overdispersed, heavy-tailed count data? Are there other ways to improve the fitting? Any suggestions would be appreciated. Thank you!
P. S.
I also tried logistic regression to predict zero/non-zero only. But none of the fitted models perform better than simply guessing zeros for all cases.
I suppose you did a histogram of the fitted values, so this will only reflect the fitted means, and possibly multiplied by the ratio of being zero depending on the model you use. It is not supposed to recreate that distribution because how spread your data can be is embedded in the dispersion parameter.
We can use an example from the pscl package:
library(pscl)
data("bioChemists")
fit <- hurdle(art ~ ., data = bioChemists,dist="negbin",zero.dist="binomial")
par(mfrow=c(1,2))
hist(fit$y,main="Observed")
hist(fit$fitted.values,main="Fitted")
As mentioned before, in this hurdle model, the fitted values you see, are the predicted means multiplied by the ratio of being zero (see more here):
head(fit$fitted.values)
1 2 3 4 5 6
1.9642025 1.2887343 1.3033753 1.3995826 2.4560884 0.8783207
head(predict(fit,type="zero")*predict(fit,type="count"))
1 2 3 4 5 6
1.9642025 1.2887343 1.3033753 1.3995826 2.4560884 0.8783207
To simulate the data based on the fitted model, we extract out the parameters:
Theta=fit$theta
Means=predict(fit,type="count")
Zero_p = predict(fit,type="prob")[,1]
Have function to simulate the counts:
simulateCounts = function(mu,theta,zero_p){
N = length(mu)
x = rnbinom(N,mu=mu,size=THETA)
x[runif(x)<zero_p] = 0
x
}
So run this simulation a number of times to get the spectrum of values:
set.seed(100)
simulated = replicate(10,simulateCounts(Means,Theta,Zero_p))
simulated = unlist(simulated)
par(mfrow=c(1,2))
hist(bioChemists$art,main="Observed")
hist(simulated,main="simulated")
I have a panel data including income for individuals over years, and I am interested in the income trends of individuals, i.e individual coefficients for income over years, and residuals for each individual for each year (the unexpected changes in income according to my model). However, I have a lot of observations with missing income data at least for one or more years, so with a linear regression I lose the majority of my observations. The data structure is like this:
caseid<-c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4)
years<-c(1998,2000,2002,2004,2006,2008,1998,2000,2002,2004,2006,2008,
1998,2000,2002,2004,2006,2008,1998,2000,2002,2004,2006,2008)
income<-c(1100,NA,NA,NA,NA,1300,1500,1900,2000,NA,2200,NA,
NA,NA,NA,NA,NA,NA, 2300,2500,2000,1800,NA, 1900)
df<-data.frame(caseid, years, income)
I decided using a random effects model, that I think will still predict income for missing years by using a maximum likelihood approach. However, since Hausman Test gives a significant result I decided to use a fixed effects model. And I ran the code below, using plm package:
inc.fe<-plm(income~years, data=df, model="within", effect="individual")
However, I get coefficients only for years and not for individuals; and I cannot get residuals.
To maybe give an idea, the code in Stata should be
xtest caseid
xtest income year
predict resid, resid
Then I tried to run the pvcm function from the same library, which is a function for variable coefficients.
inc.wi<-pvcm(Income~Year, data=ldf, model="within", effect="individual")
However, I get the following error message:
"Error in FUN(X[[i]], ...) : insufficient number of observations".
How can I get individual coefficients and residuals with pvcm by resolving this error or by using some other function?
My original long form data has 202976 observations and 15 years.
Does the fixef function from package plm give you what you are looking for?
Continuing your example:
fixef(inc.fe)
Residuals are extracted by:
residuals(inc.fe)
You have a random effects model with random slopes and intercepts. This is also known as a random coefficients regression model. The missingness is the tricky part, which (I'm guessing) you'll have to write custom code to solve after you choose how you wish to do so.
But you haven't clearly/properly specified your model (at least in your question) as far as I can tell. Let's define some terms:
Let Y_it = income for ind i (i= 1,..., N) in year t (t= 1,...,T). As I read you question, you have not specified which of the two below models you wish to have:
M1: random intercepts, global slope, random slopes
Y_it ~ N(\mu_i + B T + \gamma_i I T, \sigma^2)
\mu_i ~ N(\phi_0, \tau_0^2)
\gamma_i ~ N(\phi_1, tau_1^2)
M2: random intercepts, random slopes
Y_it ~ N(\mu_i + \gamma_i I T, \sigma^2)
\mu_i ~ N(\phi_0, \tau_0^2)
\gamma_i ~ N(\phi_1, tau_1^2)
Also, your example data is nonsensical (see below). As you can see, you don't have enough observations to estimate all parameters. I'm not familiar with library(plm) but the above models (without missingness) can be estimated in lme4 easily. Without a realistic example dataset, I won't bother providing code.
R> table(df$caseid, is.na(df$income))
FALSE TRUE
1 2 4
2 4 2
3 0 6
4 5 1
Given that you do have missingness, you should be able to produce estimates for either hierarchical model via the typical methods, such as EM. But I do think you'll have to write the code to do the estimation yourself.