Random Forest vs Logistic Regression - r

I am working on a dataset. It is a classification problem. One column of the dataset has around 11000 missing values out of total 300k observations (It is a categorical variable so missing value imputation like numerical ones is not possible).
Is it advisable to go ahead with Random Forest rather than Logistic Regression as Random Forest is not affected by missing values?
Also do i need to take care of multi-collinearity among independent variables while using RF or there is no need of that?

Although the RFM can handle noise data and missing values, it seems difficult to say that it is better than logistic. Because logistic can also be improved through other pre-processing (PCA or missing data imputation) or ensemble method.
I think RF does not have to take into account the multi-collinearity . This is because the variables are randomly selected to create different trees and produce results. In this process, the most important attributes are chosen and interpreted as solving the problem of multi-collinearity with similar trends.

Related

Imputation missing data for MLM in R

Maybe anyone can help me with this question. I conducted a follow-up study and obviously now have to face missing data. Now I am considering how to impute the missing data at best using MLM in R (f.e. participants concluded the follow up 2 survey, but not the follow up 1 survey, therefore I am missing L1 predictors for my longitudinal analysis).
I read about Multiple Imputation of multilevel data using the pan package (Schafer & Yucel, 2002) and came across the following code:
imp <- panImpute(data, formula = fml, n.burn = 1000, n.iter = 100, m = 5)
Yet, I have troubles understanding it completely. Is there maybe another way to impute missing data in R? Or maybe somebody could illustrate the process of the imputation method a bit more detailed, that would be so great! Do I have to conduct the imputation for every model I built in my MLM? (f.e. when I compared, whether a random intercept versus a random intercept and random slope model fits better for my data, do I have to use the imputation code for every model, or do I use it at the beginning of all my calculations?)
Thank you in advance
Is there maybe another way to impute missing data in R?
There are other packages. mice is the one that I normally use, and it does support multilevel data.
Do I have to conduct the imputation for every model I built in my MLM? (f.e. when I compared, whether a random intercept versus a random intercept and random slope model fits better for my data, do I have to use the imputation code for every model, or do I use it at the beginning of all my calculations?)
You have to specify the imputation model. Basically that means you have to tell the software which variables are predicted by which other variables. Since you are comparing models with the same fixed effect, and only changing the random effects (in particular comparing models with and without random slopes), the imputation model should be the same in both cases. So the workflow is:
perform the imputations;
run the model on all the imputed datasets,
pool the results (typically using Rubin's rules)
So you will need to do this twice, to end up with 2 sets of pooled results - one for each model. The software should provide functionality for doing all of this.
Having said all of that, I would advise against choosing your model based on fit statistics and instead use expert knowledge. If you have strong theoretical reasons for expecting slopes to vary by group, then include random slopes. If not, then don't include them.

Extracting normal-distributed subset from a dataset in R

Working with a dataset of ~200 observations and a number of variables. Unfortunately, none of the variables are distributed normally. If it possible to extract a data subset where at least one desired variable will be distributed normally? Want to do some statistics after (at least logistic regression).
Any help will be much appreciated,
Phil
If there are just a few observations that skew the distribution of individual variables, and no other reasons speaking against using a particular method (such as logistic regression) on your data, you might want to study the nature of "weird" observations before deciding on which analysis method to use eventually.
I would:
carry out the desired regression analysis (e.g. logistic regression), and as it's always required, carry out residual analysis (Q-Q Normal plot, Tukey-Anscombe plot, Leverage plot, also see here) to check the model assumptions. See whether the residuals are normally distributed (the normal distribution of model residuals is the actual assumption in linear regression, not that each variable is normally distributed, of course you might have e.g. bimodally distributed data if there are differences between groups), see if there are observations which could be regarded as outliers, study them (see e.g. here), and if possible remove them from the final dataset before re-fitting the linear model without outliers.
However, you always have to state which observations were removed, and on what grounds. Maybe the outliers can be explained as errors in data collection?
The issue of whether it's a good idea to remove outliers, or a better idea to use robust methods was discussed here.
as suggested by GuedesBF, you may want to find a test or model method which has no assumption of normality.
Before modelling anything or removing any data, I would always plot the data by treatment / outcome groups, and inspect the presence of missing values. After quickly looking at your dataset, it seems that quite some variables have high levels of missingness, and your variable 15 has a lot of zeros. This can be quite problematic for e.g. linear regression.
Understanding and describing your data in a model-free way (with clever plots, e.g. using ggplot2 and multiple aesthetics) is much better than fitting a model and interpreting p-values when violating model assumptions.
A good start to get an overview of all data, their distribution and pairwise correlation (and if you don't have more than around 20 variables) is to use the psych library and pairs.panels.
dat <- read.delim("~/Downloads/dput.txt", header = F)
library(psych)
psych::pairs.panels(dat[,1:12])
psych::pairs.panels(dat[,13:23])
You can then quickly see the distribution of each variable, and the presence of correlations among each pair of variables. You can tune arguments of that function to use different correlation methods, and different displays. Happy exploratory data analysis :)

Procedure to identify the most significant predictors variables using R when data has tremendous multicollinearity?

I have a database of around 36 predictor variables which I am using to predict a target variable. The target is a categorical variable consisting of three different classes whereas predictor variables include both numeric and categorical variables.
However, data is subject to severe multi-collinearity. I am trying to build a parsimonious logistic regression model so need to reduce the variables. According to VIF values results become counter intuitive as soon as I reduce the number of variables. On the other hand, I am not very sure that PCR can solve the problem as I need inferences from the sensitivity from each variable.
What is the better option to deal with such problem?
Which packages from 'R' I can use?
Will factor analysis solve the problem?
Or can we infer everything from PCR?
You have first to run ANOVA/Kruskall Wallis test to check which variables are well suited for your problem. For 36 variables I don't think you will need PCA, as this will make your model lose some explainability.
Remember that PCA will reduce dimensionality and also explain only part of the data variance. Factor Analysis will generate groups of variables in factors, in case you want to run a segmented logistic regression for each factor of grouped variables.
If you want to build a parsimonious logistic regression, you can apply some regularization so that you can increase the generalization properties of it, instead of reducing number of variables.
You can use the following R packages: caret (logistic regression), ROCR (AUC), ggplot (plot), DMwR (outliers), mice (missing values)
Also, if you want to make a regularization, you can use the following formula:
In this case, you can develop regularization from scratch, without a library, to adjust the inclination of the sigmoid, so that you can correctly classify your classes:

R: Which variables to include in model?

I'm fairly new to R and am currently trying to find the best model to predict my dependent variable from a number of predictor variables. I have 20 precictor variables and I want to see which ones I should include in my model and which ones I should exclude.
I am currently just running models with different predictor variables in each and comparing them to see which one has the lowest AIC, but this is taking a really long time. Is there an easier way to do this?
Thank you in advance.
This is more of a theoretical question actually...
In principle, if all of the predictors are actually exogenous to the model, they can all be included together and assuming you have enough data (N >> 20) and they are not too similar (which could give rise to multi-collinearity), that should help prediction. In practice, you need to think about whether each of (or any of) your predictors are actually exogenous to the model (that is, independent of the error term in the model). If they are not, then they will impart a bias on the estimates. (Also, omitting explanatory variables that are actually necessary imparts a bias.)
If predictive accuracy (even spurious in-sample accuracy) is the goal, then techniques like LASSO (as mentioned in the comments) could also help.

R code: Extracting highly correlated variables and Running multivariate regression model with selected variables

I have a huge data which has about 2,000 variables and about 10,000 observations.
Initially, I wanted to run a regression model for each one with 1999 independent variables and then do stepwise model selection.
Therefore, I would have 2,000 models.
However, unfortunately R presented errors because of lack of memory..
So, alternatively, I have tried to remove some independent variables which are low correlation value- maybe lower than .5-
With variables which are highly correlated with each dependent variable, I would like to run regression model..
I tried to do follow codes, even melt function doesn't work because of memory issue.. oh god..
test<-data.frame(X1=rnorm(50,mean=50,sd=10),
X2=rnorm(50,mean=5,sd=1.5),
X3=rnorm(50,mean=200,sd=25))
test$X1[10]<-5
test$X2[10]<-5
test$X3[10]<-530
corr<-cor(test)
diag(corr)<-NA
corr[upper.tri(corr)]<-NA
melt(corr)
#it doesn't work with my own data..because of lack of memory.
Please help me.. and thank you so much in advance..!
In such a situation if might be worth trying sparsity inducing techniques such as the Lasso. Here a sparse subset of variables is selected by constraining the sum of absolute values of the regression coefficients.
This will give you a reduced subset of variables which are the most relevant (and due to the nature of the Lasso algorithm also the most correlated, which was what you were looking for)
In R you can use the LARS package and information about the Lasso can be found here:
http://www-stat.stanford.edu/~tibs/lasso.html
Also a very good resource is: http://www-stat.stanford.edu/~tibs/ElemStatLearn/

Resources