regression - bounded depdent variable - model choice - r

I am working on a problem where i want to see if a measure (test) is a good predictor of the outcome variable (performance). Performance is a bounded variable between 0-100. I am only thinking of the methodology for now and not working with the data yet.
I am aware that there are different models and methods that deal with bounded dependent variables, but from my understanding these are useful if one is interested in predictions?
I am interested in how much variance of the dependent variable (performance)is explained by my measure (test). I am not interested in predicting specific outcomes.
Is it OK to just use normal regression?
Do i need to account for the bounded dependent variable somehow?

You can scale your dependent variable in the [0, 1] interval and run a logistic regression, that shrinks every input value into that range.
If you can, you can use fractional logit models, typically used to predict continuous outputs in the [0, 1] interval.
Alternatively, if you are into Machine Learning, you can implement a Neural Network regressor with one output note with a sigmoid activation function.

Related

How to deal with predictors with no variability (i.e. ceiling floor effect) in linear models?

I want to test the effect of 4 numeric predictors on the DV in a linear model, however one of these predictors has no variability (i.e. ceiling effect). Should I remove this or use a sepcific linear model?
This might be a question better suited to cross-validated, stats or DS since it isn't primarily about coding.
Nevertheless I can point you in the right direction.
Firstly it depends on the exact nature of your variable. You say that is no variability/variance which would mean it is constant. Assuming your DV is not constant, clearly there is no correlation or effect and the variable can be removed because it has no impact (e.g. if all my test subjects are male, SEX is a useless test variable and will predict nothing).
However if you simply mean it has a small variance or is bound by some lower or upper threshold (e.g. never rises above X) than that is different.
The best answer to this is always to calculate multiple models and simply compare.
If a model including all variables is better than one removing the specific variable, it is of course valuable to keep it. Use r² and significance of predictors to track that and keep in mind that your variable could be useful as a suppressor variable even if it does not predict DV by itself.
Finally about model selection. You already assume that the relationship of predictors and DV is linear but maybe that isn't true or at least not for all variables. You could test that assumption and work from there. But switching between linear models is unlikely to affect the outcome and no linear model is better than another in detecting a relationship between an almost constant predictor and a DV.

Calculating importance of independent variable in explaining variance of dependent variable in linear regression

I am working on a Media Mix Modeling (MMM) project where I have to build linear model for predicting traffic factoring in various spends as input variables. I have got the linear model equation which is:
Traffic = 1918 + 0.08*TV_Spend + 0.01*Print_Spend + 0.05*Display_spend
I want to calculate two things which I don't know how to do:
How much each variable is contributing in explaining variance of traffic?
What percentage of total traffic is due to each independent variable?
I think this question is already been answered several times at several places (a duplicate?);
For example see:
https://stats.stackexchange.com/questions/79399/calculate-variance-explained-by-each-predictor-in-multiple-regression-using-r
You also may want to compute the standardized regression coefs (first standardize the variables and next rerun the regression analysis) to find out which independent variable has the largest effect on the dependent variable (if significant, I would like to add). I think the interpretation of standardized regression weights is more intuitively than considering the explained variance.
Cheers,
Peter

R: Which variables to include in model?

I'm fairly new to R and am currently trying to find the best model to predict my dependent variable from a number of predictor variables. I have 20 precictor variables and I want to see which ones I should include in my model and which ones I should exclude.
I am currently just running models with different predictor variables in each and comparing them to see which one has the lowest AIC, but this is taking a really long time. Is there an easier way to do this?
Thank you in advance.
This is more of a theoretical question actually...
In principle, if all of the predictors are actually exogenous to the model, they can all be included together and assuming you have enough data (N >> 20) and they are not too similar (which could give rise to multi-collinearity), that should help prediction. In practice, you need to think about whether each of (or any of) your predictors are actually exogenous to the model (that is, independent of the error term in the model). If they are not, then they will impart a bias on the estimates. (Also, omitting explanatory variables that are actually necessary imparts a bias.)
If predictive accuracy (even spurious in-sample accuracy) is the goal, then techniques like LASSO (as mentioned in the comments) could also help.

PLM in R with time invariant variable

I am trying to analyze a panel data which includes observations for each US state collected across 45 years.
I have two predictor variables that vary across time (A,B) and one that does not vary (C). I am especially interested in knowing the effect of C on the dependent variable Y, while controlling for A and B, and for the differences across states and time.
This is the model that I have, using plm package in R.
random <- plm(Y~log1p(A)+B+C, index=c("state","year"),model="random",data=data)
My reasoning is that with a time invariant variable I should be using random rather than fixed effect model.
My question is: Is my model and thinking correct?
Thank you for your help in advance.
You base your answer about the decision between fixed and random effect soley on computational grounds. Please see the specific assumptions associated with the different models. The Hausman test is often used to discriminate between the fixed and the random effects model, but should not be taken as the definite answer (any good textbook will have further details).
Also pooled OLS could yield a good model, if it applies. Computationally, pooled OLS will also give you estimates for time-invariant variables.

Mixed Logit fitted probabilities in RSGHB

My question has to do with using the RSGHB package for predicting choice probabilities per alternative by applying mixed logit models (variation across respondents) with correlated coefficients.
I understand that the choice probabilities are simulated on an individual level and in order to get preference share an average of the individual shares would do. All the sources I have found treat each prediction as a separate simulation which makes the whole process cumbersome if many predictions are needed.
Since one can save the respondent specific coefficient draws wouldn't it be faster to simply apply the logit transform to each each (vector of) coefficient draw? Once this is done new or existing alternatives could be calculated faster than rerunning a whole simulation process for each required alternative. For the time being using a fitted() approach will not help me understand how prediction actually works.

Resources