Predict with only one observation with randomForest in R - r

I am studying loan default prediction. I am currently using R's "randomForest" package. My first model had a accuracy of 98% with sensitivity of 0.98 and specificity of 0.97 on the test data with the "predict" command.
The training and testing data had an "n" of 2865 and 319 observations, respectively.
In a real situation, where I would like to predict the probability of loan default for just one company, ie only 1 observation in the test data, would I have a problem?
The dataset I used contains only 8 predictor variables and 1 predicted variable. According to the literature, there are many more variables to be considered. Why did I get good results with just a small dataset I used? Seems "weird" to me.

Related

Forecast ARIMA using a different training set

I estimate the ARIMA model on a training dataset using the auto.arima function in R. Afterwards I am using the function forecast to make suppose 50 predictions and calculate the accuracy measures such as RMSE and MAE.
If I use the forecast function, it uses only the observations in the training set, and then makes the predictions at each time unit t using the values predicted at time t-1. What I am trying to do, is to make 1 prediction at time, adding at each time t an observed value to the training set, without reestimating the ARIMA model. So instead of considering the predicted values at time t-1, I would consider the real values. So if ARIMA has been estimated on the training dataset of 100 observations, the first forecast will be done considering the training dataset of length 100, the second forecast will consider the training set of length 101, the third forecast will take the training set of length 102 and so on.
The auto.arima output contains the datasets "x" which is the training set I use to estimate the model, and the dataset "fitted" which contains the fitted values. It also has the argument "nobs" which is the length of the dataset "x". I am trying to replace auto.arima$x with a new training dataset where the last observations are given by true values I add one at the time. I also modify "nobs" so it would give me the length of the new "x". But I noticed that the forecast for only one time ahead always considers the old training set. So for instance I added one observed value at a time to the training set and made the one ahead predictions for 50 times but all the predictions are equal to the first one. Like the forecast function ignores the fact that I replaced the "x" series inside the auto.arima output. I tried to replace the "fitted" values with the same result.
Does someone know how exactly the function "forecast" considers the training set based on which to make the predictions? What should I modify inside the auto.arima output at each time t to get the one-ahead predictions based on the real values at the previous times, instead of the estimated ones? Or there is a way to tell the "forecast" function to consider a different training dataset?
I don't want to refit ARIMA model on the new training dataset (using Arima function) and reestimate the residual variance, it takes literally forever...
Any suggestion would be helpful
Thank you in advance

predict on test data in random forest survival analysis

I'm using predict on real data cohort on a model trained by package randomForestSRC in R. The real data cohort has missing values whereas the training set that produced the model does not have missing values.
pred_cohort <- predict(model, cohort,na.action = "na.impute")
Since the cohort is a small set (only 8 observation) the number of levels of the factors are fewer than in the training set produced the model. I realize now by a coincidence that if I set the levels of the real data cohort to be the levels of the model (se code example below) I get different answers of the prediction than if I do not. why is that?
levels(cohort$var1)<levels(model$xvar$var1)
I also realize that the imputed value for the missing cells are different if I force the levels of the real data to be the levels of the model (according to the code above) then if I leave the levels as is.
The question is if this is a bug? if not, which way to prefer? And why to prefer that option?

Imbalanced training dataset and regression model

I have a large dataset (>300,000 observations) that represent the distance (RMSD) between proteins. I'm building a regression model (Random Forest) that is supposed to predict the distance between any two proteins.
My problem is that I'm more interested in close matches (short distances), however my data distribution is highly biased such that the majority of the distances are large. I don't really care how good the model will be able to predict large distances, so I want to make sure that the model will be able to accurately predict the distance of close models. However, when I train the model on the full data the performance of the model isn't good, so I wonder what is the best sampling way I can do such that I can guarantee that the model will predict the close matches distance as much accurately as possible and at the same time now to stratify the data so much since unfortunately this biased data distribution represent the real world data distribution that I'm going to validate and test the model on.
The following is my data distribution where the first column represents the distances and the second column represent the number of observations in this distance range:
Distance Observations
0 330
1 1903
2 12210
3 35486
4 54640
5 62193
6 60728
7 47874
8 33666
9 21640
10 12535
11 6592
12 3159
13 1157
14 349
15 86
16 12
The first thing I would try here is building a regression model of the log of the distance, since this will concentrate the range of larger distances. If you're using a generalised linear model this is the log link function; for other methods you could just manually do this by estimating a regression function of your inputs, x, and exponentiating the result:
y = exp( f(x) )
remember to use the log of the distance for a pair to train with.
Popular techniques for dealing with imbalanced distribution in regression include:
Random over/under-sampling.
Synthetic Minority Oversampling Technique for Regression (SMOTER). Which has an R package to implement.
The Weighted Relevance-based Combination Strategy (WERCS). Which has a GitHub repository of R codes to implement it.
PS: The table you show seems like you have a classification problem and not a regression problem.
As previously mentioned, I think what might help you given your problem is Synthetic Minority Over-Sampling Technique for Regression (SMOTER).
If you're a Python user, I'm currently working to improve my implementation of the SMOGN algorithm, a variant of SMOTER. https://github.com/nickkunz/smogn
Also, there are a few examples on Kaggle that have applied SMOGN to improve their prediction results. https://www.kaggle.com/aleksandradeis/regression-addressing-extreme-rare-cases

multinomial logistic multilevel models in R

Problem: I need to estimate a set of multinomial logistic multilevel models and can’t find an appropriate R package. What is the best R package to estimate such models? STATA 13 recently added this feature to their multilevel mixed-effects models – so the technology to estimate such models seems to be available.
Details: A number of research questions require the estimation of multinomial logistic regression models in which the outcome variable is categorical. For example, biologists might be interested to investigate which type of trees (e.g., pine trees, maple trees, oak trees) are most impacted by acid rain. Market researchers might be interested whether there is a relationship between the age of customers and the frequency of shopping at Target, Safeway, or Walmart. These cases have in common that the outcome variable is categorical (unordered) and multinomial logistic regressions are the preferred method of estimation. In my case, I am investigating differences in types of human migration, with the outcome variable (mig) coded 0=not migrated, 1=internal migration, 2=international migration. Here is a simplified version of my data set:
migDat=data.frame(hhID=1:21,mig=rep(0:2,times=7),age=ceiling(runif(21,15,90)),stateID=rep(letters[1:3],each=7),pollution=rep(c("high","low","moderate"),each=7),stringsAsFactors=F)
hhID mig age stateID pollution
1 1 0 47 a high
2 2 1 53 a high
3 3 2 17 a high
4 4 0 73 a high
5 5 1 24 a high
6 6 2 80 a high
7 7 0 18 a high
8 8 1 33 b low
9 9 2 90 b low
10 10 0 49 b low
11 11 1 42 b low
12 12 2 44 b low
13 13 0 82 b low
14 14 1 70 b low
15 15 2 71 c moderate
16 16 0 18 c moderate
17 17 1 18 c moderate
18 18 2 39 c moderate
19 19 0 35 c moderate
20 20 1 74 c moderate
21 21 2 86 c moderate
My goal is to estimate the impact of age (independent variable) on the odds of (1) migrating internally vs. not migrating, (2) migrating internationally vs. not migrating, (3) migrating internally vs. migrating internationally. An additional complication is that my data operate at different aggregation levels (e.g., pollution operates at the state-level) and I am also interested in predicting the impact of air pollution (pollution) on the odds of embarking on a particular type of movement.
Clunky solutions: One could estimate a set of separate logistic regression models by reducing the data set for each model to only two migration types (e.g., Model 1: only cases coded mig=0 and mig=1; Model 2: only cases coded mig=0 and mig=2; Model 3: only cases coded mig=1 and mig=2). Such a simple multilevel logistic regression model could be estimated with lme4 but this approach is less ideal because it does not appropriately account for the impact of the omitted cases. A second solution would be to run multinomial logistic multilevel models in MLWiN through R using the R2MLwiN package. But since MLWiN is not open source and the generated object difficult to use, I would prefer to avoid this option. Based on a comprehensive internet search there seem to be some demand for such models but I am not aware of a good R package. So it would be great if some experts who have run such models could provide a recommendation and if there are more than one package maybe indicate some advantages/disadvantages. I am sure that such information would be a very helpful resource for multiple R users. Thanks!!
Best,
Raphael
There are generally two ways of fitting a multinomial models of a categorical variable with J groups: (1) Simultaneously estimating J-1 contrasts; (2) Estimating a separate logit model for each contrast.
Produce these two methods the same results? No, but the results are often similar
Which method is better? Simultaneously fitting is more precise (see below for an explanation why)
Why would someone use separate logit models then? (1) the lme4 package has no routine for simultaneously fitting multinomial models and there is no other multilevel R package that could do this. So separate logit models are presently the only practical solution if someone wants to estimate multilevel multinomial models in R. (2) As some powerful statisticians have argued (Begg and Gray, 1984; Allison, 1984, p. 46-47), separate logit models are much more flexible as they permit for the independent specification of the model equation for each contrast.
Is it legitimate to use separate logit models? Yes, with some disclaimers. This method is called the “Begg and Gray Approximation”. Begg and Gray (1984, p. 16) showed that this “individualized method is highly efficient”. However, there is some efficiency loss and the Begg and Gray Approximation produces larger standard errors (Agresti 2002, p. 274). As such, it is more difficult to obtain significant results with this method and the results can be considered conservative. This efficiency loss is smallest when the reference category is large (Begg and Gray, 1984; Agresti 2002). R packages that employ the Begg and Gray Approximation (not multilevel) include mlogitBMA (Sevcikova and Raftery, 2012).
Why is a series of individual logit models imprecise?
In my initial example we have a variable (migration) that can have three values A (no migration), B (internal migration), C (international migration). With only one predictor variable x (age), multinomial models are parameterized as a series of binomial contrasts as follows (Long and Cheng, 2004 p. 277):
Eq. 1: Ln(Pr(B|x)/Pr(A|x)) = b0,B|A + b1,B|A (x)
Eq. 2: Ln(Pr(C|x)/Pr(A|x)) = b0,C|A + b1,C|A (x)
Eq. 3: Ln(Pr(B|x)/Pr(C|x)) = b0,B|C + b1,B|C (x)
For these contrasts the following equations must hold:
Eq. 4: Ln(Pr(B|x)/Pr(A|x)) + Ln(Pr(C|x)/Pr(A|x)) = Ln(Pr(B|x)/Pr(C|x))
Eq. 5: b0,B|A + b0,C|A = b0,B|C
Eq. 6: b1,B|A + b1,C|A = b1,B|C
The problem is that these equations (Eq. 4-6) will in praxis not hold exactly because the coefficients are estimated based on slightly different samples since only cases from the two contrasting groups are used und cases from the third group are omitted. Programs that simultaneously estimate the multinomial contrasts make sure that Eq. 4-6 hold (Long and Cheng, 2004 p. 277). I don’t know exactly how this “simultaneous” model solving works – maybe someone can provide an explanation? Software that do simultaneous fitting of multilevel multinomial models include MLwiN (Steele 2013, p. 4) and STATA (xlmlogit command, Pope, 2014).
References:
Agresti, A. (2002). Categorical data analysis (2nd ed.). Hoboken, NJ: John Wiley & Sons.
Allison, P. D. (1984). Event history analysis. Thousand Oaks, CA: Sage Publications.
Begg, C. B., & Gray, R. (1984). Calculation of polychotomous logistic regression parameters using individualized regressions. Biometrika, 71(1), 11-18.
Long, S. J., & Cheng, S. (2004). Regression models for categorical outcomes. In M. Hardy & A. Bryman (Eds.), Handbook of data analysis (pp. 258-285). London: SAGE Publications, Ltd.
Pope, R. (2014). In the spotlight: Meet Stata's new xlmlogit command. Stata News, 29(2), 2-3.
Sevcikova, H., & Raftery, A. (2012). Estimation of multinomial logit model using the Begg & Gray approximation.
Steele, F. (2013). Module 10: Single-level and multilevel models for nominal responses concepts. Bristol, U.K,: Centre for Multilevel Modelling.
An older question, but I think a viable option has recently emerged is brms, which uses the Bayesian Stan program to actually run the model For example, if you want to run a multinomial logistic regression on the iris data:
b1 <- brm (Species ~ Petal.Length + Petal.Width + Sepal.Length + Sepal.Width,
data=iris, family="categorical",
prior=c(set_prior ("normal (0, 8)")))
And to get an ordinal regression -- not appropriate for iris, of course -- you'd switch the family="categorical" to family="acat" (or cratio or sratio, depending on the type of ordinal regression you want) and make sure that the dependent variable is ordered.
Clarification per Raphael's comment: This brm call compiles your formula and arguments into Stan code. Stan compiles it into C++ and uses your system's C++ compiler -- which is required. On a Mac, for example, you may need to install the free Developer Tools to get C++. Not sure about Windows. Linux should have C++ installed by default.)
Clarification per Qaswed's comment: brms easily handles multilevel models as well using the R formula (1 | groupvar) to add a group (random) intercept for a group, (1 + foo | groupvar) to add a random intercept and slope, etc.
I'm puzzled that this technique is descried as "standard" and "equivalent", though it might well be a good practical solution. (Guess I'd better to check out the Allison and Dobson & Barnett references).
For the simple multinomial case ( no clusters, repeated measures etc.) Begg and Gray (1984) propose using k-1 binomial logits against a reference category as an approximation (though a good one) in many cases to full blown multinomial logit. They demonstrate some loss of efficiency when using a single reference category, though it's small for cases where a single high-frequency baseline category is use as the reference.
Agresti (2002: p. 274) provides an example where there is a small increase in standard errors even when the baseline category constitutes over 70% of 219 cases in a five category example.
Maybe it's no big deal, but I don't see how the approximation would get any better adding a second layer of randomness.
References
Agresti, A. (2002). Categorical data analysis. Hoboken NJ: Wiley.
Begg, C. B., & Gray, R. (1984). Calculation of polychotomous logistic regression parameters using individualized regressions. Biometrika, 71(1), 11–18.
I will recommend you to use the package "mlogit"
I am dealing with the same issue and one possible solution I found seems to resort to the poisson (loglinear/count) equivalent of the multinomial logistic model as described in this mailinglist, these nice slides or in Agresti (2013: 353-356). Thus, it should be possible to use the glmer(... family=poisson) function from the package lme4 with some aggregation of the data.
Reference:
Agresti, A. (2013) Categorical data analysis. Hoboken, NJ: Wiley.
Since I had the same problem, I recently came across this question. I found out this package called ordinal having this cumulative link mixed model function (clmm2) that seems similar to the proposed brm function, but using a frequentist approach.
Basically, you would need to set the link function (for instance as logit), you can choose to have nominal variables (meaning, those variables that are not fulfilling the proportional odds assumption), set the threshold to "flexible" if you want to allow having unstructured cut-points, and finally add the argument "random" for specifying any variable that you would like to have with a random effect.
I found also the book Multilevel Modeling Using R, W. Holmes Finch Jocelyn E. Bolin, Ken Kelley and they illustrate how to use the function from page 151, with nice examples.
Here's an implementation (not my own). I'd just work off this code. Plus, this way you'll really know what's going on under the hood.
http://www.nhsilbert.net/docs/rcode/multilevel_multinomial_logistic_regression.R

R Zeroinfl model

I am carrying out a zero-inflated negative binomial GLM on some insect count data in R. My problem is how to get R to read my species data as one stacked column so as to preserve the zero inflation. If I subtotal and import it into R as a single row titled Abundance, I loose the zeros and the model doesn't work. Already, I have tried to:
stack the data myself (there are 80 columns * 47 rows) so with 3760 rows after stacking manually you can imagine how slow R gets when using the pscl zeroinfl() command (It takes 20mins on my computer!, It still worked)
The next problem concerns a spatial correlation. Certain samplers sampled from the same medium so as to violate independence. Can I just put medium in as a factor in the model?
3760 rows take 20 mminutes with PSCL? my god, I have battle 30.000 rows :) that´s why my pscl calculation did not finish...
However, I then worked with a GLMM including nested random effects (lme/gamm) and a negative binomial distribution setting the theta to a low value so that the distribution handles the zero inflation. I think this depends on the degree of zeros. In my case it was 44% and the residuals looked rather good.

Resources