R package for Weighted Random Forest? classwt option? - r

I'm trying to use Random Forest to predict the outcome of an extremely imbalanced data set (the 1's rate is about only 1% or even less). Because the traditinal randomForest minimize the overall error rate, rather than paying special attention to the positive class, it makes the traditional randomForest not applicable for the imbalanced data. So I want to assigne a high cost to misclassification of the minority class(cost sensitive learning).
I read several sources that we can use the option classwt of randomForest on R, but I don't know how to use this. And do we have any other alternatives to the randomForest funtion?

classwt gives you the ability to assign a prior probability to each of the classes in your dataset. So, if you have classwt = c(0.5, 0.5), then you are saying that before actually running the model for your specific dataset, you expect there to be around the same number of 0's as 1's. You can adjust these parameters as you like to try to minimize false negatives. This may seem like a smart idea to assign a cost in theory, but in reality, does not work so well. The prior probabilities tend to affect the algorithm more sharply than desired. Still, you could play around with this.
An alternative solution is to run the regular random forest, and then for a prediction, use the type='prob' option in the predict() command. For instance, for a random forest rf1, where we are trying to predict the results of a dataset data1, we could do:
predictions <- predict(rf1, data=data1, type='prob')
Then, you can choose your own probability threshold for classifying the observations of your data. A nice way to graphically view which threshold may be desirable is to use the ROCR package, which generates receiver operator curve.

Related

Imputation missing data for MLM in R

Maybe anyone can help me with this question. I conducted a follow-up study and obviously now have to face missing data. Now I am considering how to impute the missing data at best using MLM in R (f.e. participants concluded the follow up 2 survey, but not the follow up 1 survey, therefore I am missing L1 predictors for my longitudinal analysis).
I read about Multiple Imputation of multilevel data using the pan package (Schafer & Yucel, 2002) and came across the following code:
imp <- panImpute(data, formula = fml, n.burn = 1000, n.iter = 100, m = 5)
Yet, I have troubles understanding it completely. Is there maybe another way to impute missing data in R? Or maybe somebody could illustrate the process of the imputation method a bit more detailed, that would be so great! Do I have to conduct the imputation for every model I built in my MLM? (f.e. when I compared, whether a random intercept versus a random intercept and random slope model fits better for my data, do I have to use the imputation code for every model, or do I use it at the beginning of all my calculations?)
Thank you in advance
Is there maybe another way to impute missing data in R?
There are other packages. mice is the one that I normally use, and it does support multilevel data.
Do I have to conduct the imputation for every model I built in my MLM? (f.e. when I compared, whether a random intercept versus a random intercept and random slope model fits better for my data, do I have to use the imputation code for every model, or do I use it at the beginning of all my calculations?)
You have to specify the imputation model. Basically that means you have to tell the software which variables are predicted by which other variables. Since you are comparing models with the same fixed effect, and only changing the random effects (in particular comparing models with and without random slopes), the imputation model should be the same in both cases. So the workflow is:
perform the imputations;
run the model on all the imputed datasets,
pool the results (typically using Rubin's rules)
So you will need to do this twice, to end up with 2 sets of pooled results - one for each model. The software should provide functionality for doing all of this.
Having said all of that, I would advise against choosing your model based on fit statistics and instead use expert knowledge. If you have strong theoretical reasons for expecting slopes to vary by group, then include random slopes. If not, then don't include them.

Extracting normal-distributed subset from a dataset in R

Working with a dataset of ~200 observations and a number of variables. Unfortunately, none of the variables are distributed normally. If it possible to extract a data subset where at least one desired variable will be distributed normally? Want to do some statistics after (at least logistic regression).
Any help will be much appreciated,
Phil
If there are just a few observations that skew the distribution of individual variables, and no other reasons speaking against using a particular method (such as logistic regression) on your data, you might want to study the nature of "weird" observations before deciding on which analysis method to use eventually.
I would:
carry out the desired regression analysis (e.g. logistic regression), and as it's always required, carry out residual analysis (Q-Q Normal plot, Tukey-Anscombe plot, Leverage plot, also see here) to check the model assumptions. See whether the residuals are normally distributed (the normal distribution of model residuals is the actual assumption in linear regression, not that each variable is normally distributed, of course you might have e.g. bimodally distributed data if there are differences between groups), see if there are observations which could be regarded as outliers, study them (see e.g. here), and if possible remove them from the final dataset before re-fitting the linear model without outliers.
However, you always have to state which observations were removed, and on what grounds. Maybe the outliers can be explained as errors in data collection?
The issue of whether it's a good idea to remove outliers, or a better idea to use robust methods was discussed here.
as suggested by GuedesBF, you may want to find a test or model method which has no assumption of normality.
Before modelling anything or removing any data, I would always plot the data by treatment / outcome groups, and inspect the presence of missing values. After quickly looking at your dataset, it seems that quite some variables have high levels of missingness, and your variable 15 has a lot of zeros. This can be quite problematic for e.g. linear regression.
Understanding and describing your data in a model-free way (with clever plots, e.g. using ggplot2 and multiple aesthetics) is much better than fitting a model and interpreting p-values when violating model assumptions.
A good start to get an overview of all data, their distribution and pairwise correlation (and if you don't have more than around 20 variables) is to use the psych library and pairs.panels.
dat <- read.delim("~/Downloads/dput.txt", header = F)
library(psych)
psych::pairs.panels(dat[,1:12])
psych::pairs.panels(dat[,13:23])
You can then quickly see the distribution of each variable, and the presence of correlations among each pair of variables. You can tune arguments of that function to use different correlation methods, and different displays. Happy exploratory data analysis :)

How to run a truncated and inflated Poisson model in R?

My data doesn't contain any zeros. The minimum value for my outcome, y, is 1 and that is the value that is inflated. My objective is to run a truncated and inflated Poisson regression model using R.
I already know how to separate way each regression zero truncated and zero inflated. I want to know how to combine the two conditions into one model.
Thanks for you help.
For zero inflated models or zero-hurdle models, the standard approach is to use pscl package. I also wrote a package fitting that kind of models here but it is not yet mature and fully tested. Unless you have voluminous data, I still recommend you to use pscl that is more flexible, robust and documented.
For zero-truncated models, you can have a look at the VGML::vglm function. You might find useful information here.
Note that you are not doing the same distributional assumption so you won't need the same estimation data. Given the description of your dataset, I think you are looking for a zero-truncated model (since you do not observe zeros). With zero-inflated models, you decompose your observed pattern into zeros generated by a selection model and others generated by a count data model. This doesn't look to be a pattern consistent with your dataset.

poLCA not stable estimates

I am trying to run a latent class analysis with covariates using polca package. However, every time I run the model, the multinomial logit coefficients result different. I have considered the changes in the order of the classes and I set up a very high number of replications (nrep=1500). However, rerunning the model I obtain different results. For example, I have 3 classes (high, low, medium). No matter the order in which the classes are considered in the estimation, the multinomial model will give me different coefficient for the same combinations after different estimations (such as low vs high and medium vs high). Should I increase further the number of repetitions in order to have stable results? Any idea of why is this happening? I know with the function set.seed() I can replicate the results but I would like to obtain stable estimates to be able to claim the validity of the results. Thank you very much!
From the manual (?poLCA):
As long as probs.start=NULL, each function call will use different
(random) initial starting parameters
you need to use set.seed() or set probs.start in order to get consistent results across function calls.
Actually, if with different starting points you are not converging, you have a data problem.
LCA uses a kind of maximum likelihood estimation. If there is no convergence, you have an under-identification problem: you have too little information to estimate the number of classes that you have. Lower class numbers might run, or you will have to make some a-priori restrictions.
You might wish to read Latent Class and Latent Transition Analysis by Collins. It was a great help for me.

Mixed Logit fitted probabilities in RSGHB

My question has to do with using the RSGHB package for predicting choice probabilities per alternative by applying mixed logit models (variation across respondents) with correlated coefficients.
I understand that the choice probabilities are simulated on an individual level and in order to get preference share an average of the individual shares would do. All the sources I have found treat each prediction as a separate simulation which makes the whole process cumbersome if many predictions are needed.
Since one can save the respondent specific coefficient draws wouldn't it be faster to simply apply the logit transform to each each (vector of) coefficient draw? Once this is done new or existing alternatives could be calculated faster than rerunning a whole simulation process for each required alternative. For the time being using a fitted() approach will not help me understand how prediction actually works.

Resources