Binary variables (features) used in a classification model - data visualisation - r

I suppose this is less of a coding question but more of a plea for advice from experienced data scientists:
Suppose I have a classification problem at hand with 15 features (variables), all those variables are binary type - 1|0. Naturally, the output is also 1|0
I'm looking for a best / most widely used method of presenting and visualizing the output from the test data.
In the case the context might be helpful, the model is supposed to predict whether loan applicants will default on their payments after some period of time, if they're employed|unemployed had more than x days delinquency etc.
The model is randomForest

Related

How can Keras predict sequences of sales (individually) of 11106 distinct customers, each a series of varying length (anyway from 1 to 15 periods)

I am approaching a problem that Keras must offer an excellent solution for, but I am having problems developing an approach (because I am such a neophyte concerning anything for deep learning). I have sales data. It contains 11106 distinct customers, each with its time series of purchases, of varying length (anyway from 1 to 15 periods).
I want to develop a single model to predict each customer's purchase amount for the next period. I like the idea of an LSTM, but clearly, I cannot make one for each customer; even if I tried, there would not be enough data for an LSTM in any case---the longest individual time series only has 15 periods.
I have used types of Markov chains, clustering, and regression in the past to model this kind of data. I am asking the question here, though, about what type of model in Keras is suited to this type of prediction. A complication is that all customers can be clustered by their overall patterns. Some belong together based on similarity; others do not; e.g., some customers spend with patterns like $100-$100-$100, others like $100-$100-$1000-$10000, and so on.
Can anyone point me to a type of sequential model supported by Keras that might handle this well? Thank you.
I am trying to achieve this in R. Haven't been able to build a model that gives me more than about .3 accuracy.
I don't think the main difficulty is coming from which model to use as much as how to frame the problem.
As you mention, "WHO" is spending the money seems as relevant as their past transaction in knowing how much they will likely spend.
But you cannot train 10k+ models either for each customers.
Instead I would suggest clustering your customers base, and instead trying to fit a model by cluster, using all the time series combined for the customers in that cluster to train the same model.
This would allow each model to learn the spending pattern of that particular group.
For that you can use LTSM or RNN model.
Hi here's my suggestion and I will edit it later to provide you with more information
Since its a sequence problem you should use RNN based models: LSTM, GRU's

glmmLasso function won't finish running

I am trying to build a Mixed Model Lasso model using glmmLasso in RStudio. However, I am looking for some assistance.
I have the equation of my model as follows:
glmmModel <- glmmLasso(outcome ~ year + married ,list(ID=~1), lambda = 100, family=gaussian(link="identity"), data=data1,control = list(print.iter=TRUE))
where outcome is a continuous variable, year is the year the data was collected, and married is a binary indicator (1/0) of whether or not the subject is married. I eventually would like to include more covariates in my model, but for the purpose of successfully first getting this to run, right now I am just attempting to run a model with these two covariates. My data1 dataframe is 48000 observations and 57 variables.
When I click run, however, the model runs for many hours (48+) without stopping. The only feedback I am getting is "ITERATION 1," "ITERATION 2," etc... Is there something I am missing or doing wrong? Please note, I am running on a machine with only 8 GB RAM, but I don't think this should be the issue, right? My dataset (48000 observations) isn't particularly large (at least I don't think so). Any advice or thoughts would be appreciated on how I can fix this issue. Thank you!
This is too long to be a comment, but I feel like you deserve an answer to this confusion.
It is not uncommon to experience "slow" performance. In fact in many glmm implementations it is more common than not. The fact is that Generalized Linear Mixed Effect models are very hard to estimate. For purely gaussian models (no penalizer) a series of proofs gives us the REML estimator, which can be estimated very efficiently, but for generalized models this is not the case. As such note that the Random Effect model matrix can become absolutely massive. Remember that for every random effect, you obtain a block-diagonal matrix so even for small sized data, you might have a model matrix with 2000+ columns, that needs to go through optimization through PIRLS (inversions and so on).
Some packages (glmmTMB, lme4 and to some extend nlme) have very efficient implementations that abuse the block-diagonality of the random effect matrix and high-performance C/C++ libraries to perform optimized sparse-matrix calculations, while the glmmLasso (link to source) package uses R-base to perform all of it's computations. No matter how we go about it, the fact that it does not abuse sparse computations and implements it's code in R, causes it to be slow.
As a side-note, my thesis project had about 24000~ observations, with 3 random effect variables (and some odd 20 fixed effects). The fitting process of this dataset could take anywhere between 15 minutes to 3 hours, depending on the complexity, and was primarily decided by the random effect structure.
So the answer from here:
Yes glmmLasso will be slow. It may take hours, days or even weeks depending on your dataset. I would suggest using a stratified (or/and clustered) subsample across independent groups, fit the model using a smaller dataset (3000 - 4000 maybe?), to obtain initial starting points, and "hope" that these are close to the real values. Be patient. If you think neural networks are complex, welcome to the world of generalized mixed effect models.

Model with Interval and Categorical Independant and Dependant Variables

I need help with creating a model in R/choosing the right analysis.
I work for a program that trains unemployed people and eventually finds them jobs. However, we have found that many participants quit their jobs or are fired (both referred to as a negative termination) soon after beginning. I have been assigned to create a model that helps predict what factors cause these outcomes.
The model will have to be fairly complex due to the number and variety of variables. Ideally, it will take into account:
DVs
- termination type (categorical)
- time employed (interval)
IVs
- barriers (probably 3 binary categorical variables)
- number of trainings completed (interval)
- percent assigned trainings completed (interval)
- age (interval)
- gender (binary categorical)
- race/ethnicity (categorical)
I have researched various methods of analysis (particularly regression), but have yet to find one that can handle all the bases I need to cover in terms of variable diversity. I am working in R, so I would appreciate any responses to mention relevant packages or code.
Thanks so much!

Differences in scoring from PMML model on different platforms

I've built a toy Random Forest model in R (using the German Credit dataset from the caret package), exported it in PMML 4.0 and deployed onto Hadoop, using the Cascading Pattern library.
I've run into an issue where Cascading Pattern scores the same data differently (in a binary classification problem) than the same model in R. Out of 200 observations, 2 are scored differently.
Why is this? Could it be due to a difference in the implementation of Random Forests?
The German Credit dataset represents a classification-type problem. The winning score of a classification-type RF model is simply the class label that was the most frequent among member decision trees.
Suppose you have RF model with 100 decision trees, and 50 decision trees predict "good credit" and another 50 decision trees predict "bad credit". It is possible that R and Cascading Pattern resolve such tie situations differently - one picks the score that is seen first and the other picks the score that is seen last. You could try re-training your RF model with odd number of member decision trees (ie. use some value that is not divisible by two, such as 99 or 101).
The PMML specification tells to return the score that was seen first. I'm not sure if Cascading Pattern pays any attention to such details. You may want to try out an alternative solution called JPMML-Cascading.
Score matching is a big deal. When a model is moved from the scientist's desktop to the production IT deployment environment, the scores need to match. For a classification task, that also includes the probabilities of all target categories. There is sometimes a problem of precision between different implementations/platforms which can result in minimal differences (really minimal). In any case, they also need to be checked.
Obviously, it could also be the case that the model was not represented correctly in PMML ... unlikely with the R PMML Package. The other option is that the model is not deployed correctly. That is, the scoring engine cascading is using is not interpreting the PMML file properly.
PMML itself has a model element called ModelVerification that allows for a PMML file to contain scored data which can then be used for score matching. This is useful but not necessary since you should be able to score an already scored dataset and compared computed with expected results which you are already doing.
For more on model verification and score matching as well as error handling in PMML, check:
https://support.zementis.com/entries/21207918-Verifying-your-model-in-ADAPA-did-it-upload-correctly-

regressions with many nested categorical covariates

I have a few hundred thousand measurements where the dependent
variable is a probability, and would like to use logistic regression.
However, the covariates I have are all categorical, and worse, are all
nested. By this I mean that if a certain measurement has "city -
Phoenix" then obviously it is certain to have "state - Arizona" and
"country - U.S." I have four such factors - the most granular has
some 20k levels, but if need be I could do without that one, I think.
I also have a few non-nested categorical covariates (only four or so,
with maybe three different levels each).
What I am most interested in
is prediction - given a new observation in some city, I would like to
know the relevant probability/dependent variable. I am not interested
as much in the related inferential machinery - standard deviations,
etc - at least as of now. I am hoping I can afford to be sloppy.
However, I would love to have that information unless it requires
methods that are more computationally expensive.
Does anyone have any advice on how to attack this? I have looked into
mixed effects, but am not sure it is what I am looking for.
I think this is more of model design question than on R specifically; as such, I'd like to address the context of the question first then the appropriate R packages.
If your dependent variable is a probability, e.g., $y\in[0,1]$, a logistic regression is not data appropriate---particularly given that you are interested in predicting probabilities out of sample. The logistic is going to be modeling the contribution of the independent variables to the probability that your dependent variable flips from a zero to a one, and since your variable is continuous and truncated you need a different specification.
I think your latter intuition about mixed effects is a good one. Since your observations are nested, i.e., US <-> AZ <-> Phoenix, a multi-level model, or in this case a hierarchical linear model, may be the best specification for your data. The best R packages for this type of modeling are multilevel and nlme, and there is an excellent introduction to both multi-level models in R and nlme available here. You may be particularly interested in the discussion of data manipulation for multi-level modeling, which begins on page 26.
I would suggest looking into penalised regressions like the elastic net. The elastic net is used in text mining where each column represents the present or absence of a single word, and there maybe hundreds of thousands of variables, an analogous problem to yours. A good place to start with R would be the glmnet package and its accompanying JSS paper: http://www.jstatsoft.org/v33/i01/.

Resources