I was checking FixedEffectModels.jl package and I realized that in their solution method they don't have an intercept for their regression model. So for a one regressor model, they run y = x + e instead of y = a + x + e. I included a constant term in the formula, but the reported coefficient associated with the constant term is zero and other statistics (like Std. Error, t value, etc) are NaN.
Does FixedEffectsModel.jl automatically add a constant term to the formula. If not, how can I do that?
This isn't really a Julia or FixedEffectsModels question I'd say - I think you have a misunderstanding as to what a fixed effect model does.
See e.g. this answer here: https://stats.stackexchange.com/a/435865/149657 on Cross Validated.
In short, you are including a constant for each individual in your panel; you can think of these as one dummy variable per individual, with each dummy taking the value of one for an observation of the relevant individual. If you were to add all the individual dummy variables up, they would be a column of all ones - exactly collinear with the intercept. The intercept therefore can't be identified, leading to exactly the issues that you're seeing.
Have a look around on Cross Validated including the further links in the answer I linked above, and maybe read Chapters 13 and 14 of Wooldridge's Introductory Econometrics.
Related
I fitted an rpart model in Leave One Out Cross Validation on my data using Caret library in R. Everything is ok, but I want to understand the difference between model's variable importance and decision tree plot.
Calling the variable importance with the function varImp() shows nine variables. Plotting the decision tree using functions such as fancyRpartPlot() or rpart.plot() shows a decision tree that uses only two variables to classify all subjects.
How can it be? Why does the decision tree plot not shows the same nine variables from the variable importance table?
Thank you.
Similar to rpart(), Caret has a cool property: it deals with surrogate variables, i.e. variables that are not chosen for splits, but that were close to win the competition.
Let me be more clear. Say at a given split, the algorithm decided to split on x1. Suppose also there is another variable, say x2, which would be almost as good as x1 for splitting at that stage. We call x2 surrogate, and we assign it its variable importance as we do for x1.
This is way you can get in the importance ranking variables that are actually not used for splitting. You can also find that such variables are more important than others actuall used!
The rationale for this is explained in the documentation for rpart(): suppose we have two identical covariates, say x3 and x4. Then rpart() is likely to split on one of them only, e.g., x3. How can we say that x4 is not important?
To conclude, variable importance considers the increase in fit for both primary variables (i.e., the ones actually chosen for splitting) and surrogate variables. So, the importance for x1 considers both splits for which x1 is chosen as splitting variable, and splits for which another variables is chosen but x1 is a close competitor.
Hope this clarifies your doubts. For more details, see here. Just a quick quotation:
The following methods for estimating the contribution of each variable to the model are available [speaking of how variable importance is computed]:
[...]
- Recursive Partitioning: The reduction in the loss function (e.g. mean squared error) attributed to each variable at each split is tabulated and the sum is returned. Also, since there may be candidate variables that are important but are not used in a split, the top competing variables are also tabulated at each split. This can be turned off using the maxcompete argument in rpart.control.
I am not used to caret, but from this quote it appears that such package actually uses rpart() to grow trees, thus inheriting the property about surrogate variables.
I want to test the effect of 4 numeric predictors on the DV in a linear model, however one of these predictors has no variability (i.e. ceiling effect). Should I remove this or use a sepcific linear model?
This might be a question better suited to cross-validated, stats or DS since it isn't primarily about coding.
Nevertheless I can point you in the right direction.
Firstly it depends on the exact nature of your variable. You say that is no variability/variance which would mean it is constant. Assuming your DV is not constant, clearly there is no correlation or effect and the variable can be removed because it has no impact (e.g. if all my test subjects are male, SEX is a useless test variable and will predict nothing).
However if you simply mean it has a small variance or is bound by some lower or upper threshold (e.g. never rises above X) than that is different.
The best answer to this is always to calculate multiple models and simply compare.
If a model including all variables is better than one removing the specific variable, it is of course valuable to keep it. Use r² and significance of predictors to track that and keep in mind that your variable could be useful as a suppressor variable even if it does not predict DV by itself.
Finally about model selection. You already assume that the relationship of predictors and DV is linear but maybe that isn't true or at least not for all variables. You could test that assumption and work from there. But switching between linear models is unlikely to affect the outcome and no linear model is better than another in detecting a relationship between an almost constant predictor and a DV.
So I am running a regression in which my hypotheses states that the dependent variable influences the effect of the independent variable on the dependent variable, if that makes any sense. In essence the as the dependent variable increases I expect the beta of the independent variable to decrease.
I wanted to solve this by using an interaction term.
y = b0 + b1*x1 + b2*x2 + b3*x2*y.
Does this cause any problems. Is this statistically viable? I cannot find any information on this, but I wasn't sure if i am supposed to do this since now my b2 changes from significantly positive to significantly negative which seems strange. b3 is significantly positive btw.
Just for some extra clarification. My dataset consist of the the number of mobile application downloads (DV), the average rating (IV) and the number of ratings (IV). Now the hypothesis is that less popular applications require more information because popularity is an indication of quality to the consumer. That is why I would like to include an interaction between the popularity and the rating variables. To me the best measure of popularity seems the number of downloads of course.
My code, performed in r, for the regression is as follows:
an_5 <- lm(new_Install ~ Rating + Reviews + Reviews:new_Install + Rating:new_Install, data=Data_1)
summary(an_5)
Expected results would be all 4 to be significant but the last two to be negative and the first two to be positive however the opposite is the case which seems strange. I will gladly provide more information.
This is not a statistically viable approach. You are basically just using your independent variable as a distortion for
the actual dependent variable effectively leaking information that is never supposed to be part of the predictors.
From your statement "In essence the as the dependent variable increases I expect the beta of the independent variable to decrease"
I conclude that you would like the variable x2 to have a lower marginal influence on the dependent variable as the dependent variable
becomes larger. You can achieve something conceptually similar by transforming the independent variables appropriately. For example, assuming x2 has a positive coefficient, the log transformation + log_reviews (where log_reviews = log(reviews)) would mimic a variable with diminishing marginal positive effect.
My goal is to estimate two parameters of a model (see CE_hat).
I use 7 observations to fit two parameters: (w,a), so overfitting occurs a few times. One idea would be to restrict the influence of each observation so that outliers do not "hijack" the parameter estimates.
The method that has been previously suggested to me was nlrob. The problem with that however is that extreme cases such as the example below, return Missing value or an infinity produced when evaluating the model.
To avoid this I used nlsLM which works towards a convergence at the cost of returning outlandish estimates.
Any ideas as to how I can use robust fitting with this example?
I include below a reproducible example. The observables here are CE, H and L. These three elements are fed into a function (CE_hat) in order to estimate "a" and "w". Values close to 1 for "a" and close to 0.5 for "w" are generally considered to be more reasonable. As you - hopefully - can see, when all observations are included, a=91, while w=next to 0. However, if we were to exclude the 4th (or 7th) observation (for CE, H and L), we get much more sensible estimates. Ideally, I would like to achieve the same result, without excluding these observations. One idea would be to restrict their influence. I understand that it might not be as clear why these observations constitute some sort of "outliers". It's hard to say something about that without saying too much I am afraid but I am happy to go into more details about the model should a question arise.
library("minpack.lm")
options("scipen"=50)
CE<-c(3.34375,6.6875,7.21875,13.375,14.03125,14.6875,12.03125)
H<-c(4,8,12,16,16,16,16)
L<-c(0,0,0,0,4,8,12)
CE_hat<-function(w,H,a,L){(w*(H^a-L^a)+L^a)^(1/a)}
aw<-nlsLM(CE~CE_hat(w,H,a,L),
start=list(w=0.5,a=1),
control = nls.lm.control(nprint=1,maxiter=100))
summary(aw)$parameters
My question is quite simple, but I've been unable to find a clear answer in either R manuals or online searching. Is there a good way to verify what your reference is for the response variable when doing a logistic regression with glmer?
I am getting results that consistently run the exact opposite of theory and I think my response variable must be reversed from my intention, but I have been unable to verify.
My response variable is coded in 0's and 1's.
Thanks!
You could simulate some data where you know the true effects ... ?simulate.merMod makes this relatively easy. In any case,
the effects are interpreted in terms of their effect on the log-odds of a response of 1
e.g., a slope of 0.5 implies that a 1-unit increase in the predictor variable increases the log-odds of observing a 1 rather than a 0 by 0.5.
for questions of this sort, glmer inherits its framework from glm. In particular, ?family states:
For the ‘binomial’ and ‘quasibinomial’ families the response can
be specified in one of three ways:
1. As a factor: ‘success’ is interpreted as the factor not
having the first level (and hence usually of having the
second level).
2. As a numerical vector with values between ‘0’ and ‘1’,
interpreted as the proportion of successful cases (with the
total number of cases given by the ‘weights’).
3. As a two-column integer matrix: the first column gives the
number of successes and the second the number of failures.
Your data are a (common) special case of #2 (the "proportion of successes" is either zero or 100% for each case, because there is only one case per observation; the weights vector is a vector of all ones by default).