Imbalanced training dataset and regression model - r

I have a large dataset (>300,000 observations) that represent the distance (RMSD) between proteins. I'm building a regression model (Random Forest) that is supposed to predict the distance between any two proteins.
My problem is that I'm more interested in close matches (short distances), however my data distribution is highly biased such that the majority of the distances are large. I don't really care how good the model will be able to predict large distances, so I want to make sure that the model will be able to accurately predict the distance of close models. However, when I train the model on the full data the performance of the model isn't good, so I wonder what is the best sampling way I can do such that I can guarantee that the model will predict the close matches distance as much accurately as possible and at the same time now to stratify the data so much since unfortunately this biased data distribution represent the real world data distribution that I'm going to validate and test the model on.
The following is my data distribution where the first column represents the distances and the second column represent the number of observations in this distance range:
Distance Observations
0 330
1 1903
2 12210
3 35486
4 54640
5 62193
6 60728
7 47874
8 33666
9 21640
10 12535
11 6592
12 3159
13 1157
14 349
15 86
16 12

The first thing I would try here is building a regression model of the log of the distance, since this will concentrate the range of larger distances. If you're using a generalised linear model this is the log link function; for other methods you could just manually do this by estimating a regression function of your inputs, x, and exponentiating the result:
y = exp( f(x) )
remember to use the log of the distance for a pair to train with.

Popular techniques for dealing with imbalanced distribution in regression include:
Random over/under-sampling.
Synthetic Minority Oversampling Technique for Regression (SMOTER). Which has an R package to implement.
The Weighted Relevance-based Combination Strategy (WERCS). Which has a GitHub repository of R codes to implement it.
PS: The table you show seems like you have a classification problem and not a regression problem.

As previously mentioned, I think what might help you given your problem is Synthetic Minority Over-Sampling Technique for Regression (SMOTER).
If you're a Python user, I'm currently working to improve my implementation of the SMOGN algorithm, a variant of SMOTER. https://github.com/nickkunz/smogn
Also, there are a few examples on Kaggle that have applied SMOGN to improve their prediction results. https://www.kaggle.com/aleksandradeis/regression-addressing-extreme-rare-cases

Related

Extracting linear term from a polynomial predictor in a GLM

I am relatively new to both R and Stack overflow so please bear with me. I am currently using GLMs to model ecological count data under a negative binomial distribution in brms. Here is my general model structure, which I have chosen based on fit, convergence, low LOOIC when compared to other models, etc:
My goal is to characterize population trends of study organisms over the study period. I have created marginal effects plots by using the model to predict on a new dataset where all covariates are constant except year (shaded areas are 80% and 95% credible intervals for posterior predicted means):
I am now hoping to extract trend magnitudes that I can report and compare across species (i.e. say a certain species declined or increased by x% (+/- y%) per year). Because I use poly() in the model, my understanding is that R uses orthogonal polynomials, and the resulting polynomial coefficients are not easily interpretable. I have tried generating raw polynomials (setting raw=TRUE in poly()), which I thought would produce the same fit and have directly interpretable coefficients. However, the resulting models don't really run (after 5 hours neither chain gets through even a single iteration, whereas the same model with raw=FALSE only takes a few minutes to run). Very simplified versions of the model (e.g. count ~ poly(year, 2, raw=TRUE)) do run, but take several orders of magnitude longer than setting raw=FALSE, and the resulting model also predicts different counts than the model with orthogonal polynomials. My questions are (1) what is going on here? and (2) more broadly, how can I feasibly extract the linear term of the quartic polynomial describing response to year, or otherwise get at a value corresponding to population trend?
I feel like this should be relatively simple and I apologize if I'm overlooking something obvious. Please let me know if there is further code that I should share for more clarity–I didn't want to make the initial post crazy long, but happy to show specific predictions from different models or anything else. Thank you for any help.

How to calculate Bias and Variance for SVM and Random Forest Model

I'm working on a classification problem (predicting three classes) and I'm comparing SVM against Random Forest in R.
For evaluation and comparison I want to calculate the bias and variance of the models. I've looked up the two terms in many machine learning books and I'd say I do understand the sense of variance and bias (easiest explanation with the bullseye). But I can't really figure out how to apply it in my case.
Let's say I predict the results for a test set with 4 SVM-models that were trained with 4 different training sets. Each time I get a total error (meaning all wrong predictions/all predictions).
Do I then get the bias for SVM by calculating this?
which would mean that the bias is more or less the mean of the errors?
I hope you can help me with not to complicated formula, because I've already seen many of them.

Inconsistency between confusion matrix and classified image

Due to computational limitations with my GIS software, I am trying to implement random forests in R for image classification purposes. My input is a multi-band TIFF image, which is being trained on an ArcGIS shapefile (target values 0 and 1). The code technically works and produces a valid output. When I view the confusion matrix I get the following:
0 1 class.error
0 11 3 0.214285714
1 1 13 0.071428571
This is sensible for my data. However when I plot up the output of the image classification in my GIS software (the binary reclassified tiff with values 0 and 1), it predicts the training data with a 100% success rate. In other words there is no classification error with the output image. How is this the case when the confusion matrix indicates there are classification errors?
Am I missing something really obvious here? Code snippet below.
rf.mdl <- randomForest(x=samples#data[, names(PredMaps)], y=samples#data[, ValueFld], ntree=501, proximity=TRUE, importance=TRUE, keep.forest=TRUE,keep.inbag=TRUE)
ConfMat = rf.mdl$confusion
write.csv(ConfMat,file = "ConfMat1.csv")
predict(PredMaps, rf.mdl, filename=classifiedPath, type="response", na.rm=T, overwrite=T, progress="text")
I expected the output classified image to misclassify 1 of the Value=1 training points and misclassify 3 of the Value=0 training points based on what is indicated in the confusion matrix.
The Random Forest algorithm is a bagging method. This means it creates numerous weak classifiers, then has each weak classifier "vote" to create the end prediction. In RF, each weak classifier is one decision tree that is trained on a random sample of observations in the training set. Think of the random samples each decision tree is trained on as a "bag" of data.
What is being shown in the confusion matrix is something called "out-of-bag error" (OOB error). This OOB error is an accurate estimate of how your model would generalize to data it has never seen before (this estimate is usually achieved by testing your model on a withheld testing set). Since each decision tree is trained on only one bag from your training data, the rest of the data (data that's "outside the bag") can stand in for this withheld data.
OOB error is calculated by making a prediction for each observation in the training set. However, when predicting each individual observation, only decision trees whose bags did not include that observation are allowed to participate in the voting process. The result is the confusion matrix available after training a RF model.
When you predict the observations in the training set using the complete model, decision trees whose bags did include each observation are now involved in the voting process. Since these decision trees "remember" the observation they were trained on, they skew the prediction toward the correct answer. This is why you achieve 100% accuracy.
Essentially, you should trust the confusion matrix that uses OOB error. It's a robust estimate of how the model will generalize to unseen data.

multinomial logistic multilevel models in R

Problem: I need to estimate a set of multinomial logistic multilevel models and can’t find an appropriate R package. What is the best R package to estimate such models? STATA 13 recently added this feature to their multilevel mixed-effects models – so the technology to estimate such models seems to be available.
Details: A number of research questions require the estimation of multinomial logistic regression models in which the outcome variable is categorical. For example, biologists might be interested to investigate which type of trees (e.g., pine trees, maple trees, oak trees) are most impacted by acid rain. Market researchers might be interested whether there is a relationship between the age of customers and the frequency of shopping at Target, Safeway, or Walmart. These cases have in common that the outcome variable is categorical (unordered) and multinomial logistic regressions are the preferred method of estimation. In my case, I am investigating differences in types of human migration, with the outcome variable (mig) coded 0=not migrated, 1=internal migration, 2=international migration. Here is a simplified version of my data set:
migDat=data.frame(hhID=1:21,mig=rep(0:2,times=7),age=ceiling(runif(21,15,90)),stateID=rep(letters[1:3],each=7),pollution=rep(c("high","low","moderate"),each=7),stringsAsFactors=F)
hhID mig age stateID pollution
1 1 0 47 a high
2 2 1 53 a high
3 3 2 17 a high
4 4 0 73 a high
5 5 1 24 a high
6 6 2 80 a high
7 7 0 18 a high
8 8 1 33 b low
9 9 2 90 b low
10 10 0 49 b low
11 11 1 42 b low
12 12 2 44 b low
13 13 0 82 b low
14 14 1 70 b low
15 15 2 71 c moderate
16 16 0 18 c moderate
17 17 1 18 c moderate
18 18 2 39 c moderate
19 19 0 35 c moderate
20 20 1 74 c moderate
21 21 2 86 c moderate
My goal is to estimate the impact of age (independent variable) on the odds of (1) migrating internally vs. not migrating, (2) migrating internationally vs. not migrating, (3) migrating internally vs. migrating internationally. An additional complication is that my data operate at different aggregation levels (e.g., pollution operates at the state-level) and I am also interested in predicting the impact of air pollution (pollution) on the odds of embarking on a particular type of movement.
Clunky solutions: One could estimate a set of separate logistic regression models by reducing the data set for each model to only two migration types (e.g., Model 1: only cases coded mig=0 and mig=1; Model 2: only cases coded mig=0 and mig=2; Model 3: only cases coded mig=1 and mig=2). Such a simple multilevel logistic regression model could be estimated with lme4 but this approach is less ideal because it does not appropriately account for the impact of the omitted cases. A second solution would be to run multinomial logistic multilevel models in MLWiN through R using the R2MLwiN package. But since MLWiN is not open source and the generated object difficult to use, I would prefer to avoid this option. Based on a comprehensive internet search there seem to be some demand for such models but I am not aware of a good R package. So it would be great if some experts who have run such models could provide a recommendation and if there are more than one package maybe indicate some advantages/disadvantages. I am sure that such information would be a very helpful resource for multiple R users. Thanks!!
Best,
Raphael
There are generally two ways of fitting a multinomial models of a categorical variable with J groups: (1) Simultaneously estimating J-1 contrasts; (2) Estimating a separate logit model for each contrast.
Produce these two methods the same results? No, but the results are often similar
Which method is better? Simultaneously fitting is more precise (see below for an explanation why)
Why would someone use separate logit models then? (1) the lme4 package has no routine for simultaneously fitting multinomial models and there is no other multilevel R package that could do this. So separate logit models are presently the only practical solution if someone wants to estimate multilevel multinomial models in R. (2) As some powerful statisticians have argued (Begg and Gray, 1984; Allison, 1984, p. 46-47), separate logit models are much more flexible as they permit for the independent specification of the model equation for each contrast.
Is it legitimate to use separate logit models? Yes, with some disclaimers. This method is called the “Begg and Gray Approximation”. Begg and Gray (1984, p. 16) showed that this “individualized method is highly efficient”. However, there is some efficiency loss and the Begg and Gray Approximation produces larger standard errors (Agresti 2002, p. 274). As such, it is more difficult to obtain significant results with this method and the results can be considered conservative. This efficiency loss is smallest when the reference category is large (Begg and Gray, 1984; Agresti 2002). R packages that employ the Begg and Gray Approximation (not multilevel) include mlogitBMA (Sevcikova and Raftery, 2012).
Why is a series of individual logit models imprecise?
In my initial example we have a variable (migration) that can have three values A (no migration), B (internal migration), C (international migration). With only one predictor variable x (age), multinomial models are parameterized as a series of binomial contrasts as follows (Long and Cheng, 2004 p. 277):
Eq. 1: Ln(Pr(B|x)/Pr(A|x)) = b0,B|A + b1,B|A (x)
Eq. 2: Ln(Pr(C|x)/Pr(A|x)) = b0,C|A + b1,C|A (x)
Eq. 3: Ln(Pr(B|x)/Pr(C|x)) = b0,B|C + b1,B|C (x)
For these contrasts the following equations must hold:
Eq. 4: Ln(Pr(B|x)/Pr(A|x)) + Ln(Pr(C|x)/Pr(A|x)) = Ln(Pr(B|x)/Pr(C|x))
Eq. 5: b0,B|A + b0,C|A = b0,B|C
Eq. 6: b1,B|A + b1,C|A = b1,B|C
The problem is that these equations (Eq. 4-6) will in praxis not hold exactly because the coefficients are estimated based on slightly different samples since only cases from the two contrasting groups are used und cases from the third group are omitted. Programs that simultaneously estimate the multinomial contrasts make sure that Eq. 4-6 hold (Long and Cheng, 2004 p. 277). I don’t know exactly how this “simultaneous” model solving works – maybe someone can provide an explanation? Software that do simultaneous fitting of multilevel multinomial models include MLwiN (Steele 2013, p. 4) and STATA (xlmlogit command, Pope, 2014).
References:
Agresti, A. (2002). Categorical data analysis (2nd ed.). Hoboken, NJ: John Wiley & Sons.
Allison, P. D. (1984). Event history analysis. Thousand Oaks, CA: Sage Publications.
Begg, C. B., & Gray, R. (1984). Calculation of polychotomous logistic regression parameters using individualized regressions. Biometrika, 71(1), 11-18.
Long, S. J., & Cheng, S. (2004). Regression models for categorical outcomes. In M. Hardy & A. Bryman (Eds.), Handbook of data analysis (pp. 258-285). London: SAGE Publications, Ltd.
Pope, R. (2014). In the spotlight: Meet Stata's new xlmlogit command. Stata News, 29(2), 2-3.
Sevcikova, H., & Raftery, A. (2012). Estimation of multinomial logit model using the Begg & Gray approximation.
Steele, F. (2013). Module 10: Single-level and multilevel models for nominal responses concepts. Bristol, U.K,: Centre for Multilevel Modelling.
An older question, but I think a viable option has recently emerged is brms, which uses the Bayesian Stan program to actually run the model For example, if you want to run a multinomial logistic regression on the iris data:
b1 <- brm (Species ~ Petal.Length + Petal.Width + Sepal.Length + Sepal.Width,
data=iris, family="categorical",
prior=c(set_prior ("normal (0, 8)")))
And to get an ordinal regression -- not appropriate for iris, of course -- you'd switch the family="categorical" to family="acat" (or cratio or sratio, depending on the type of ordinal regression you want) and make sure that the dependent variable is ordered.
Clarification per Raphael's comment: This brm call compiles your formula and arguments into Stan code. Stan compiles it into C++ and uses your system's C++ compiler -- which is required. On a Mac, for example, you may need to install the free Developer Tools to get C++. Not sure about Windows. Linux should have C++ installed by default.)
Clarification per Qaswed's comment: brms easily handles multilevel models as well using the R formula (1 | groupvar) to add a group (random) intercept for a group, (1 + foo | groupvar) to add a random intercept and slope, etc.
I'm puzzled that this technique is descried as "standard" and "equivalent", though it might well be a good practical solution. (Guess I'd better to check out the Allison and Dobson & Barnett references).
For the simple multinomial case ( no clusters, repeated measures etc.) Begg and Gray (1984) propose using k-1 binomial logits against a reference category as an approximation (though a good one) in many cases to full blown multinomial logit. They demonstrate some loss of efficiency when using a single reference category, though it's small for cases where a single high-frequency baseline category is use as the reference.
Agresti (2002: p. 274) provides an example where there is a small increase in standard errors even when the baseline category constitutes over 70% of 219 cases in a five category example.
Maybe it's no big deal, but I don't see how the approximation would get any better adding a second layer of randomness.
References
Agresti, A. (2002). Categorical data analysis. Hoboken NJ: Wiley.
Begg, C. B., & Gray, R. (1984). Calculation of polychotomous logistic regression parameters using individualized regressions. Biometrika, 71(1), 11–18.
I will recommend you to use the package "mlogit"
I am dealing with the same issue and one possible solution I found seems to resort to the poisson (loglinear/count) equivalent of the multinomial logistic model as described in this mailinglist, these nice slides or in Agresti (2013: 353-356). Thus, it should be possible to use the glmer(... family=poisson) function from the package lme4 with some aggregation of the data.
Reference:
Agresti, A. (2013) Categorical data analysis. Hoboken, NJ: Wiley.
Since I had the same problem, I recently came across this question. I found out this package called ordinal having this cumulative link mixed model function (clmm2) that seems similar to the proposed brm function, but using a frequentist approach.
Basically, you would need to set the link function (for instance as logit), you can choose to have nominal variables (meaning, those variables that are not fulfilling the proportional odds assumption), set the threshold to "flexible" if you want to allow having unstructured cut-points, and finally add the argument "random" for specifying any variable that you would like to have with a random effect.
I found also the book Multilevel Modeling Using R, W. Holmes Finch Jocelyn E. Bolin, Ken Kelley and they illustrate how to use the function from page 151, with nice examples.
Here's an implementation (not my own). I'd just work off this code. Plus, this way you'll really know what's going on under the hood.
http://www.nhsilbert.net/docs/rcode/multilevel_multinomial_logistic_regression.R

Variable sample size per cluster/group in mixed effects logistic regression

I am attempting to run mixed effects logistic regression models, yet am concerned about the variable samples sizes in each cluster/group, and also the very low number of "successes" in some models.
I have ~ 700 trees distributed across 163 field plots (i.e., the cluster/group), visited annually from 2004-11. I am fitting separate mixed effects logistic regression models (hereafter GLMMs) for each year of the study to compare this output to inference from a shared frailty model (i.e., survival analysis with random effect).
The number of trees per plot varies from 1-22. Also, some years have a very low number of "successes" (i.e., diseased trees). For example, in 2011 there were only 4 successes out of 694 "failures" (i.e., healthy trees).
My questions are: (1) is there a general rule for the ideal number of samples|group when the inference focus is only on estimating the fixed effects in the GLMM, and (2) are GLMMs stable when there is such an extreme difference in the ratio of successes:failures.
Thank you for any advice or suggestions of sources.
-Sarah
(Hi, Sarah, sorry I didn't answer previously via e-mail ...)
It's hard to answer these questions in general -- you're stuck
with your data, right? So it's not a question of power analysis.
If you want to make sure that your results will be reasonably
reliable, probably the best thing to do is to run some simulations.
I'm going to show off a fairly recent feature of lme4 (in the
development version 1.1-1, on Github), which is to simulate
data from a GLMM given a formula and a set of parameters.
First I have to simulate the predictor variables (you wouldn't
have to do this, since you already have the data -- although
you might want to try varying the range of number of plots,
trees per plot, etc.).
set.seed(101)
## simulate number of trees per plot
## want mean of 700/163=4.3 trees, range=1-22
## by trial and error this is about right
r1 <- rnbinom(163,mu=3.3,size=2)+1
## generate plots and trees within plots
d <- data.frame(plot=factor(rep(1:163,r1)),
tree=factor(unlist(lapply(r1,seq))))
## expand by year
library(plyr)
d2 <- ddply(d,c("plot","tree"),
transform,year=factor(2004:2011))
Now set up the parameters: I'm going to assume year is a fixed
effect and that overall disease incidence is plogis(-2)=0.12 except
in 2011 when it is plogis(-2-3)=0.0067. The among-plot standard deviation
is 1 (on the logit scale), as is the among-tree-within-plot standard
deviation:
beta <- c(-2,0,0,0,0,0,0,-3)
theta <- c(1,1) ## sd by plot and plot:tree
Now simulate: year as fixed effect, plot and tree-within-plot as
random effects
library(lme4)
s1 <- simulate(~year+(1|plot/tree),family=binomial,
newdata=d2,newparams=list(beta=beta,theta=theta))
d2$diseased <- s1[[1]]
Summarize/check:
d2sum <- ddply(d2,c("year","plot"),
summarise,
n=length(tree),
nDis=sum(diseased),
propDis=nDis/n)
library(ggplot2)
library(Hmisc) ## for mean_cl_boot
theme_set(theme_bw())
ggplot(d2sum,aes(x=year,y=propDis))+geom_point(aes(size=n),alpha=0.3)+
stat_summary(fun.data=mean_cl_boot,colour="red")
Now fit the model:
g1 <- glmer(diseased~year+(1|plot/tree),family=binomial,
data=d2)
fixef(g1)
You can try this many times and see how often the results are reliable ...
As Josh said, this is a better questions for CrossValidated.
There are no hard and fast rules for logistic regression, but one rule of thumb is 10 successes and 10 failures are needed per cell in the design (cluster in this case) times the number continuous variables in the model.
In your case, I would think the model, if it converges, would be unstable. You can examine that by bootstrapping the errors of the estimates of the fixed effects.

Resources