Handling Sparse Data Frames - algorithm selection - r

I am new to machine learning/statistical modelling.
I am trying to run a classification on a highly sparse dataset with 100 features, most of which are categorical (TRUE/FALSE) with the remaining values missing. To handle missing values, I filled the missing spots with the text 'Nothing', thereby creating a new level.
Next, I am trying to run a logistic regression using a penalty (glmnet package). When I check the coefficients, I see dummy variables corresponding to 'Nothing' having the higher coefficients.
How should I remove these coefficients? What would be a better approach to this?
Or should I just use trees? Please suggest the best way forward.
Thanks!

Related

Random Forest vs Logistic Regression

I am working on a dataset. It is a classification problem. One column of the dataset has around 11000 missing values out of total 300k observations (It is a categorical variable so missing value imputation like numerical ones is not possible).
Is it advisable to go ahead with Random Forest rather than Logistic Regression as Random Forest is not affected by missing values?
Also do i need to take care of multi-collinearity among independent variables while using RF or there is no need of that?
Although the RFM can handle noise data and missing values, it seems difficult to say that it is better than logistic. Because logistic can also be improved through other pre-processing (PCA or missing data imputation) or ensemble method.
I think RF does not have to take into account the multi-collinearity . This is because the variables are randomly selected to create different trees and produce results. In this process, the most important attributes are chosen and interpreted as solving the problem of multi-collinearity with similar trends.

R: training random forest using PCA data

I have a data set called Data, with 30 scaled and centered features and 1 outcome with column name OUTCOME, referred to 700k records, stored in data.table format. I computed its PCA, and observed that its first 8 components account for the 95% of the variance. I want to train a random forest in h2o, so this is what I do:
Data.pca=prcomp(Data,retx=TRUE) # compute the PCA of Data
Data.rotated=as.data.table(Data.pca$x)[,c(1:8)] # keep only first 8 components
Data.dump=cbind(Data.rotated,subset(Data,select=c(OUTCOME))) # PCA dataset plus outcomes for training
This way I have a dataset Data.dump where I have 8 features that are rotated on the PCA components, and at each record I associated its outcome.
First question: is this rational? or do I have to permute somehow the outcomes vector? or the two things are unrelated?
Then I split Data.dump in two sets, Data.train for training and Data.test for testing, all as.h2o. The I feed them to a random forest:
rf=h2o.randomForest(training_frame=Data.train,x=1:8,y=9,stopping_rounds=2,
ntrees=200,score_each_iteration=T,seed=1000000)
rf.pred=as.data.table(h2o.predict(rf,Data.test))
What happens is that rf.pred seems not so similar to the original outcomes Data.test$OUTCOME. I tried to train a neural network as well, and did not even converge, crashing R.
Second question: is it because I am carrying on some mistake from the PCA treatment? or because I badly set up the random forest? Or I am just dealing with annoying data?
I do not know where to start, as I am new to data science, but the workflow seems correct to me.
Thanks a lot in advance.
The answer to your second question (i.e. "is it the data, or did I do something wrong") is hard to know. This is why you should always try to make a baseline model first, so you have an idea of how learnable the data is.
The baseline could be h2o.glm(), and/or it could be h2o.randomForest(), but either way without the PCA step. (You didn't say if you are doing a regression or a classification, i.e. if OUTCOME is a number or a factor, but both glm and random forest will work either way.)
Going to your first question: yes, it is a reasonable thing to do, and no you don't have to (in fact, should not) involve the outcomes vector.
Another way to answer your first question is: no, it unreasonable. It may be that a random forest can see all the relations itself without needing you to use a PCA. Remember when you use a PCA to reduce the number of input dimensions you are also throwing away a bit of signal, too. You said that the 8 components only capture 95% of the variance. So you are throwing away some signal in return for having fewer inputs, which means you are optimizing for complexity at the expense of prediction quality.
By the way, concatenating the original inputs and your 8 PCA components, is another approach: you might get a better model by giving it this hint about the data. (But you might not, which is why getting some baseline models first is essential, before trying these more exotic ideas.)

Best way to handle sparse + non-sparse data to create a model

I'm wondering what is the best way to handle sparse+non-sparse data in e.g. a Ridge regression using scikit learn.
Ridge can handle both sparse and nonsparse data.
Imagine something simple as a description (text) field that gets Count/Tdidf Vectorized (sparse), and an income continuous variable.
Now imagine that we have several other text fields and several other continuous variables.
What is the best way to model some continuous y variable?
I've considered making two separate models (one using sparse data, one using non-sparse) and somehow trying to combine.
I've also considered using PCA to make the sparse data into a "handleable" amount of continuous features.
How do you usually solve this issue?
Note: the continuous variables would have many unique values (and you'd lose power anyway when converting continuous to bins), and the text fields might end up having like a million features, thus not able to be dense.
this reply may be a little out of context, but i want to understand by "Ridge can handle both sparse and no-sparse data"? I am trying to run a logistic regression model in R which has all text fields, however, my dependent variable is very sparse. Only .9%. Do you think Ridge would be good algo to implement?

Multiple Imputation on New/Predictor Data

Can someone please help me understand how to handle missing values in new/unseen data? I have investigated a handful of multiple imputation packages in R and all seem only to impute the training and test set (at the same time). How do you then handle new unlabeled data to estimate in the same way that you did train/test? Basically, I want to use multiple imputation for missing values in the training/test set and also the same model/method for predictor data. Based on my research of multiple imputation (not an expert), it does not seem feasible to do this with MI? However, for example, using caret, you can easily use the same model that was used in training/test for new data. Any help would be greatly appreciated. Thanks.
** Edit
Basically, My data set contains many missing values. Deletion is not an option as it will discard most of my train/test set. Up to this point, I have encoded categorical variables, removed near zero variance and highl correlated variables. After this preprocessing, I was able to easily apply the mice package for imputation
m=mice(sg.enc)
At this point, I could use the pool command to apply the model against the imputed data sets. That works fine. However, I know that future data will have missing values and would like to somehow apply this MI incrementally?
It does not have multiple imputation, but the yaImpute package has a predict() function to impute values for new data. I ran a test using training data (that included NAs) to create a "yai" object, then used that object via predict() to impute values in a new testing data set. Unlike Caret preProcess(), yaImpute supports factor variables (at least for imputing values for them) in its knn algorithm. I did not yet test if factors can be part of the "predictors" for the missing target variables. yaImpute does support other imputation methods besides knn.

R code: Extracting highly correlated variables and Running multivariate regression model with selected variables

I have a huge data which has about 2,000 variables and about 10,000 observations.
Initially, I wanted to run a regression model for each one with 1999 independent variables and then do stepwise model selection.
Therefore, I would have 2,000 models.
However, unfortunately R presented errors because of lack of memory..
So, alternatively, I have tried to remove some independent variables which are low correlation value- maybe lower than .5-
With variables which are highly correlated with each dependent variable, I would like to run regression model..
I tried to do follow codes, even melt function doesn't work because of memory issue.. oh god..
test<-data.frame(X1=rnorm(50,mean=50,sd=10),
X2=rnorm(50,mean=5,sd=1.5),
X3=rnorm(50,mean=200,sd=25))
test$X1[10]<-5
test$X2[10]<-5
test$X3[10]<-530
corr<-cor(test)
diag(corr)<-NA
corr[upper.tri(corr)]<-NA
melt(corr)
#it doesn't work with my own data..because of lack of memory.
Please help me.. and thank you so much in advance..!
In such a situation if might be worth trying sparsity inducing techniques such as the Lasso. Here a sparse subset of variables is selected by constraining the sum of absolute values of the regression coefficients.
This will give you a reduced subset of variables which are the most relevant (and due to the nature of the Lasso algorithm also the most correlated, which was what you were looking for)
In R you can use the LARS package and information about the Lasso can be found here:
http://www-stat.stanford.edu/~tibs/lasso.html
Also a very good resource is: http://www-stat.stanford.edu/~tibs/ElemStatLearn/

Resources