I need help with creating a model in R/choosing the right analysis.
I work for a program that trains unemployed people and eventually finds them jobs. However, we have found that many participants quit their jobs or are fired (both referred to as a negative termination) soon after beginning. I have been assigned to create a model that helps predict what factors cause these outcomes.
The model will have to be fairly complex due to the number and variety of variables. Ideally, it will take into account:
DVs
- termination type (categorical)
- time employed (interval)
IVs
- barriers (probably 3 binary categorical variables)
- number of trainings completed (interval)
- percent assigned trainings completed (interval)
- age (interval)
- gender (binary categorical)
- race/ethnicity (categorical)
I have researched various methods of analysis (particularly regression), but have yet to find one that can handle all the bases I need to cover in terms of variable diversity. I am working in R, so I would appreciate any responses to mention relevant packages or code.
Thanks so much!
Related
Is it possible to use both cluster standard errors and multilevel models together and how does one implement this in R?
In my set up I am running a conjoint experiment in 26 countries with 2000 participants per country. Like any conjoint experiment each participant is shown two vignettes and asked to choose/rate each vignette. The same participants is then shown two fresh vignettes for comparison and asked to repeat the task. In this case each participant performs two comparisons. The hierarchy is thus comparisons nested within individuals nested within countries. I am currently running a multilevel model with each comparison at level 1 and country is the level 2 unit. Obviously comparisons within individuals are likely to be correlated so I'd like to cluster standard errors at the individual level as well. It seems overkill to add another level in the MLM for this since the size of my clusters are extremely small (n=2) and it makes more sense to do my analysis at the individual level (not to mention unnecessarily complicating the model since with 2000 individuals*26 countries the parameter space becomes crazy huge). Is this possible? If so how does one do this in R together with a multilevel model set up?
The cluster size of 2 is not an issue, and I don't see any issue with the parameter space either. If you fit random intercepts for participants, and countries, these are estimated as latent normally distributed variables. A model such as:
lmer(outomce ~ fixed effects + (1|country/participant)
This will handle the dependencies within clusters (at the participant level and the country level) so there will be no need to use cluster standard errors.
Dataset Description: I use a dataset with neuropsychological (np) tests from several subjects. Every subject has more than one tests in his/her follow up i.e one test per year. I study the cognitive decline in these subjects. The information that I have are: Individual number(identity number), Education(years), Gender(M/F as factor), Age(years), Time from Baseline (= years after the first np test).
AIM: My aim is to measure the rate of change in their np tests i.e the cognitive decline per year for each of them. To do that I use Linear Mixture Effects Models (LMEM), taking into account the above parameters and I compute the slope for each subject.
Question: When I run the possible models (combining different parameters every time), I also check their singularity and the result in almost all cases is TRUE. So my models present singularity! In the case that I would like to use these models to do predictions this is not good as it means that the model overfits the data. But now that I just want to find the slope for each individual I think that this is not a problem, or even better I think that this is an advantage, as in that case singularity offers a more precise calculation for the subjects' slopes. Do you think that this thought is correct?
I currently have a data set of about 300 people on behaviors (answers: yes/no/NA) + variables on age, place of residence (city/country), income, etc.
In principle, I would like to find out the item difficulties for the overall sample (with which R-package is the best?-How does that work? Don't fully understand some codes :/)
and in the next step examine different groups (young, old, city/country, income (median split) with regard to their possibly significantly different item difficulties.
How do I do that? (is this possible with Wald tests, Rasch trees, or raschmix?) (do I need latent groups - which are grouped data-driven)?
I suppose this is less of a coding question but more of a plea for advice from experienced data scientists:
Suppose I have a classification problem at hand with 15 features (variables), all those variables are binary type - 1|0. Naturally, the output is also 1|0
I'm looking for a best / most widely used method of presenting and visualizing the output from the test data.
In the case the context might be helpful, the model is supposed to predict whether loan applicants will default on their payments after some period of time, if they're employed|unemployed had more than x days delinquency etc.
The model is randomForest
I have a few hundred thousand measurements where the dependent
variable is a probability, and would like to use logistic regression.
However, the covariates I have are all categorical, and worse, are all
nested. By this I mean that if a certain measurement has "city -
Phoenix" then obviously it is certain to have "state - Arizona" and
"country - U.S." I have four such factors - the most granular has
some 20k levels, but if need be I could do without that one, I think.
I also have a few non-nested categorical covariates (only four or so,
with maybe three different levels each).
What I am most interested in
is prediction - given a new observation in some city, I would like to
know the relevant probability/dependent variable. I am not interested
as much in the related inferential machinery - standard deviations,
etc - at least as of now. I am hoping I can afford to be sloppy.
However, I would love to have that information unless it requires
methods that are more computationally expensive.
Does anyone have any advice on how to attack this? I have looked into
mixed effects, but am not sure it is what I am looking for.
I think this is more of model design question than on R specifically; as such, I'd like to address the context of the question first then the appropriate R packages.
If your dependent variable is a probability, e.g., $y\in[0,1]$, a logistic regression is not data appropriate---particularly given that you are interested in predicting probabilities out of sample. The logistic is going to be modeling the contribution of the independent variables to the probability that your dependent variable flips from a zero to a one, and since your variable is continuous and truncated you need a different specification.
I think your latter intuition about mixed effects is a good one. Since your observations are nested, i.e., US <-> AZ <-> Phoenix, a multi-level model, or in this case a hierarchical linear model, may be the best specification for your data. The best R packages for this type of modeling are multilevel and nlme, and there is an excellent introduction to both multi-level models in R and nlme available here. You may be particularly interested in the discussion of data manipulation for multi-level modeling, which begins on page 26.
I would suggest looking into penalised regressions like the elastic net. The elastic net is used in text mining where each column represents the present or absence of a single word, and there maybe hundreds of thousands of variables, an analogous problem to yours. A good place to start with R would be the glmnet package and its accompanying JSS paper: http://www.jstatsoft.org/v33/i01/.