I was working on a multiple regression model that predicts amount of insurance claims based on certain factors. One such (categorical) factor is the room type the person has access to as part of the insurance package (eg. VIP room). The problem is that a few room types have a high variability in claims which results in them being insignificant predictors (p value as high as 0.6 for those levels). My suggestion is to create two separate models, one with room type as a predictor and one without. If a person is part of one of the rooms with high variability then the model without room type as a predictor should be used otherwise the better fit model can be used (has a higher adjusted R^2).
My question is, is there something incorrect with this procedure?
Thank you.
I don't know how many possible types of rooms you have there, but it can be that some categories have a very low volume compared to the others. If that's the case, I'd rather try to combine types with similar characteristics as new categories. That may increase volume and make them significant.
It's hard to suggest things without seeing the data.
Related
Is it possible to use both cluster standard errors and multilevel models together and how does one implement this in R?
In my set up I am running a conjoint experiment in 26 countries with 2000 participants per country. Like any conjoint experiment each participant is shown two vignettes and asked to choose/rate each vignette. The same participants is then shown two fresh vignettes for comparison and asked to repeat the task. In this case each participant performs two comparisons. The hierarchy is thus comparisons nested within individuals nested within countries. I am currently running a multilevel model with each comparison at level 1 and country is the level 2 unit. Obviously comparisons within individuals are likely to be correlated so I'd like to cluster standard errors at the individual level as well. It seems overkill to add another level in the MLM for this since the size of my clusters are extremely small (n=2) and it makes more sense to do my analysis at the individual level (not to mention unnecessarily complicating the model since with 2000 individuals*26 countries the parameter space becomes crazy huge). Is this possible? If so how does one do this in R together with a multilevel model set up?
The cluster size of 2 is not an issue, and I don't see any issue with the parameter space either. If you fit random intercepts for participants, and countries, these are estimated as latent normally distributed variables. A model such as:
lmer(outomce ~ fixed effects + (1|country/participant)
This will handle the dependencies within clusters (at the participant level and the country level) so there will be no need to use cluster standard errors.
I currently have a data set of about 300 people on behaviors (answers: yes/no/NA) + variables on age, place of residence (city/country), income, etc.
In principle, I would like to find out the item difficulties for the overall sample (with which R-package is the best?-How does that work? Don't fully understand some codes :/)
and in the next step examine different groups (young, old, city/country, income (median split) with regard to their possibly significantly different item difficulties.
How do I do that? (is this possible with Wald tests, Rasch trees, or raschmix?) (do I need latent groups - which are grouped data-driven)?
I need help with creating a model in R/choosing the right analysis.
I work for a program that trains unemployed people and eventually finds them jobs. However, we have found that many participants quit their jobs or are fired (both referred to as a negative termination) soon after beginning. I have been assigned to create a model that helps predict what factors cause these outcomes.
The model will have to be fairly complex due to the number and variety of variables. Ideally, it will take into account:
DVs
- termination type (categorical)
- time employed (interval)
IVs
- barriers (probably 3 binary categorical variables)
- number of trainings completed (interval)
- percent assigned trainings completed (interval)
- age (interval)
- gender (binary categorical)
- race/ethnicity (categorical)
I have researched various methods of analysis (particularly regression), but have yet to find one that can handle all the bases I need to cover in terms of variable diversity. I am working in R, so I would appreciate any responses to mention relevant packages or code.
Thanks so much!
I am currently working with a small dataset of training values, no more than 20, and am getting large MSE. The input data vectors themselves consist of 16 parameters, many of which are binary variables. Across all the training values, a majority of the 16 parameters stay the same (but not all). The remaining input variables, across all the exemplars, vary a lot amongst one another. This is to say, two exemplars might appear to be the same except for two parameters in which they differ, one parameter being a binary variable, and another being a continuous variable, where the difference could be greater than a single standard deviation (for that variable's set of values).
My single output variable (as of now) can either be a continuous variable, OR depending on the true difficulty of reducing the error in my situation, I can make this a classification problem instead, with 12 different forms for classification.
I have long been researching different neural networks than my current implementation of a feed-forward MLP, as I have read into Stochastic NNs, Ladder NNs, and many forms of recurrent NNs. I am stuck with which one I should investigate, as I do not have time to try every NN available.
While my description may be vague, could anyone make a suggestion as to which network I should investigate to minimize my cost function (as of now, MSE) the most?
If my current setup must be rendered implacable because of the sheer difficulty involved with predicting correct output for such a small set of highly variant training values, which network would best work, should my dataset be expanded to the order of thousands of exemplars (at the cost of having a significantly more redundant, seemingly homogenous set of input values)?
Any help is most certainly appreciated.
20 samples is very small especially if you have 16 input variables. It will be hard to determine which one of those inputs is responsible for your output value. If you keep your network simple (fewer layers) you may be able to use as many samples as you would need for traditional regression.
I am working with a dataset of roughly 1.5 million observations. I am finding that running a regression tree (I am using the mob()* function from the party package) on more than a small subset of my data is taking extremely long (I can't run on a subset of more than 50k obs).
I can think of two main problems that are slowing down the calculation
The splits are being calculated at each step using the whole dataset. I would be happy with results that chose the variable to split on at each node based on a random subset of the data, as long as it continues to replenish the size of the sample at each subnode in the tree.
The operation is not being parallelized. It seems to me that as soon as the tree has made it's first split, it ought to be able to use two processors, so that by the time there are 16 splits each of the processors in my machine would be in use. In practice it seems like only one is getting used.
Does anyone have suggestions on either alternative tree implementations that work better for large datasets or for things I could change to make the calculation go faster**?
* I am using mob(), since I want to fit a linear regression at the bottom of each node, to split up the data based on their response to the treatment variable.
** One thing that seems to be slowing down the calculation a lot is that I have a factor variable with 16 types. Calculating which subset of the variable to split on seems to take much longer than other splits (since there are so many different ways to group them). This variable is one that we believe to be important, so I am reluctant to drop it altogether. Is there a recommended way to group the types into a smaller number of values before putting it into the tree model?
My response comes from a class I took that used these slides (see slide 20).
The statement there is that there is no easy way to deal with categorical predictors with a large number of categories. Also, I know that decision trees and random forests will automatically prefer to split on categorical predictors with a large number of categories.
A few recommended solutions:
Bin your categorical predictor into fewer bins (that are still meaningful to you).
Order the predictor according to means (slide 20). This is my Prof's recommendation. But what it would lead me to is using an ordered factor in R
Finally, you need to be careful about the influence of this categorical predictor. For example, one thing I know that you can do with the randomForest package is to set the randomForest parameter mtry to a lower number. This controls the number of variables that the algorithm looks through for each split. When it's set lower you'll have fewer instances of your categorical predictor appear vs. the rest of the variables. This will speed up estimation times, and allow the advantage of decorrelation from the randomForest method ensure you don't overfit your categorical variable.
Finally, I'd recommend looking at the MARS or PRIM methods. My professor has some slides on that here. I know that PRIM is known for being low in computational requirement.