calibration and liftchart with caret R package - r

I am comparing various predictive models on a binary classification task using the caret R package with respect to their predictive performance (liftChart) and prediction accuracy (calibration plot). I found the following issues:
1. Sometimes the lift function is very very slow when the number of observation is quite big or there are various competing classifiers. In addition I wonder whether it is possible to manually define the cuts of the calibration plot. I have a severe imbalanced model (average probability is 5%) and the calibration plot function assumes evenly spaced cuts.

The lift plot does the calculation for every unique probability value (much like an ROC curve), which is why it is slow.
Neither of those options are available right now. You can add two issues to the github page. I'm fairly swamped right now but those shouldn't be a big deal to change (you could always contribute solutions too).
Max

Related

Speed up estimation of overlapping additive models (mgcv)

I have some set of variables and I'm fitting many (hundreds of thousands) additive models, each of which includes a subset of all the variables. The dependent variable is the same in every case, and some of the models overlap or are nested. Not all of the independent variables have to enter the model nonparametrically. For clarity, I might have a set of variables {x1,x2,x3,x4,x5} and estimate:
a) y=c+f(x1)+f(x2),
b) y=c+x1+f(x2),
c) y=c+f(x1)+f(x2)+x3, etc.
I'm wondering if there is anything I can do to speed up the gam estimation in this case? Is there anything that is being calculated over and over again that I could calculate once and supply to the function?
What I have already tried:
Memoization since the models repeat exactly from time to time.
Reluctantly switched from thin plate regression splines to cubic regression splines (quite a significant improvement).
The mgcv guide says:
The user can retain most of the advantages of the t.p.r.s. approach by supplying a reduced set of covariate values from which to obtain the basis - typically the number of covariate values used will be substantially smaller than the number of data, and substantially larger than the basis dimension, k.
This caused quite a noticeable improvement with smaller models, e.g. 5 smooths, but not with larger models, e.g. 10 smooths. In fact, in the latter case, it often caused the estimation to take (potentially much) longer.
What I'd like to try but don't know if it's possible:
One obvious thing that repeats itself in both, say, y=c+f(x1)+f(x2) and y=c+x1+f(x2), is the calculation of the basis for f(x2). If I were to use the same knots every time, how (if it's possible at all) could I precalculate the basis for every variable and then supply that to mgcv? Would you expect this to bring a significant time improvement?
Is there anything else you'd recommend?

Alternatives to LDA for big datasets

I'm analyzing a big dataset of gene expression in R, with 100 samples and 50.000 genes.
I already made some very informative PCA projections of inter-sample patterns. Now I want to make some projections of the data maximizing the differences between the labels I have for the samples.
Normally I would do this with the lda() function from the MASS package. However, this is way too slow and memory intensive.
If the goal is to produce a projection of the samples maximizing the difference between known labels, what are some good alternatives to lda()?
Thanks!
Summary of our discussion in the comments to the question
Linear discriminant analysis does not work on data sets with more features than observations, so you need some form of regularization. If you want to do classification but are mainly interested in the predictive patterns, rather than the predictions themselves, you can use partial least squares discriminant analysis (PLSDA).
However, in your case the components of PLSDA might be hard to interpret since they will contain one coefficient per gene, but it seems unrealistic to believe that all 50000 genes are relevant to the phenotype you are studying. An alternative approach I prefer is to use nearest shrunken centroids or elastic net that produces sparse models (i.e. they only keep the best genes and discard those of little importance).
You could run your LDA model on a sample of the data set.

Profiling SVM (e1071) in R

I am new to R and SVMs and I am trying to profile svm function from e1071 package. However, I can't find any large dataset that allows me to get a good profiling range of results varying the size of the input data. Does anyone know how to work svm out? Which dataset should I use? Any particular parameters to svm that makes it work harder?
I copy some commands that I am using to test the performance. Perhaps it is most useful and easier to get what I am trying here:
#loading libraries
library(class)
library(e1071)
#I've been using golubEsets (more examples availables)
library(golubEsets)
#get the data: matrix 7129x38
data(Golub_Train)
n <- exprs(Golub_Train)
#duplicate rows(to make the dataset larger)
n<-rbind(n,n)
#take training samples as a vector
samplelabels <- as.vector(Golub_Train#phenoData#data$ALL.AML)
#calculate svm and profile it
Rprof('svm.out')
svmmodel1 <- svm(x=t(n), y=samplelabels, type='C', kernel="radial", cross=10)
Rprof(NULL)
I keep increasing the dataset duplicating rows and columns but I reached the limit of memory instead of making svm works harder...
In terms of "working SVM out" - what will make SVM work "harder" is a more complex model which is not easily separated, higher dimensionality and a larger, denser dataset.
SVM performance degrades with:
Dataset size increases (number of data points)
Sparsity decreases (fewer zeros)
Dimensionality increases (number of attributes)
Non-linear kernels are used (and kernel parameters can make the
kernel evaluation more complex)
Varying Parameters
Are there parameters you can change to make SVM take longer. Of course the parameters affect the quality of the solution you will get and may not make any sense to use.
Using C-SVM, varying C will result in different runtimes. (The similar parameter in nu-SVM is nu) If the dataset is reasonably separable, making C smaller will result in a longer runtime because the SVM will allow more training points to become support vectors. If the dataset is not very separable, making C bigger will cause longer run times because you are essentially telling SVM you want a narrow-margin solution which fits tightly to the data and that will take much longer to compute when the data doesn't easily separate.
Often you find when doing a parameter search that there are parameters that will increase computation time with no appreciable increase in accuracy.
The other parameters are kernel parameters and if you vary them to increase the complexity of calculating the kernel then naturally the SVM runtime will increase. The linear kernel is simple and will be the fastest; non-linear kernels will of course take longer. Some parameters may not increase the calculation complexity of the kernel, but will force a much more complex model, which may take SVM much longer to find the optimal solution to.
Datasets to Use:
The UCI Machine Learning Repository is a great source of datasets.
The MNIST handwriting recognition dataset is a good one to use - you can randomly select subsets of the data to create increasingly larger sized datasets. Keep in mind the data at the link contains all digits, SVM is of course binary so you would have to reduce the data to just two digits or do some kind of multi-class SVM.
You can easily generate datasets as well. To generate a linear dataset, randomly select a normal vector to a hyperplane, then generate a datapoint and determine which side of the hyperplane it falls on to label it. Add some randomness to allow points within a certain distance of the hyperplane to sometimes be labeled differently. Increase the complexity by increasing that overlap between classes. Or generate some numbers of clusters of normally distributed points, labeled either 1 or -1, so that the distributions overlap at the edges. The classic non-linear example is a checkerboard. Generate points and label them in a checkerboard pattern. To make it more difficult enlarge the number of squares, increase the dimensions and increase the number of datapoints. You will have to use a non-linear kernel for that of course.

two-sided censored model in R (similar to Zeligs Tobit)?

Is there a model for dependent variables that are censored on both sides? And if so is there an implementation in R? I am only aware of tobit models (e.g. in Zelig package), but they´re obviously only censored on the left side... I wonder if it even makes sense to truncate on both sides...
There's a difference between truncation and censoring. You need to be aware of which is the case before you start modeling. (in a nutshell: Censoring means events can be detected, but the measurements are not known completely (i.e. in your case you neither know the exact beginning nor the exact end of the time interval subjects were under risk for the event you're considering). Truncation means events can be observed only if another condition is fullfilled: a popular example is survival in a retirement home that only accepts people over 65 to take up residence - entry into the study population is then truncated at age 65.)
if you have both left- and right censored data or data that are simultaneously right- and left-censored, the techncal term you are looking for is interval censored. ?Surv in package survival will show you how to define interval censored observations for modelling time-to-event in that case.
In a very real sense most of the observational studies on "free-range human" populations are doubly censored... i.e. we do not observe the individuals over all of their lifespans. Here is a citation to a PhD thesis that seems to lay out the statistical terminology well. Furthermore, several of the packages in R will function properly when set up for interval censoring or left-censoring, including packages survival, NADA, sand (from their DOE website) and several others for which you can search at Baron's website with appropriate search strategies in this link that sets up that page to get both functions and r-help entries.
Edit: Adding comments to address the clarification that this is about truncation rather than censoring.
If one is looking to fit to truncated distributions then look at the gamlss package, or create a suitable density for a doubly-truncated distribution and use fitdistr in the MASS package.

What R packages are available for binary data that is both correlated and clustered?

I'm working on a project now that's rather unlike anything I've done before. I have two tests with binary results that will be administered to the same sample, which is drawn from a clustered population (i.e., some subjects will be from the same family). I'd like to compare proportions of positive test results, but the clustering makes McNemar's test inappropriate so I've been reading up on alternative approaches. The two main routes seem to be 1) the clustering-adjusted McNemar alternatives by Rao and Scott (1992), Eliasziw and Donner (1991), and Obuchowski (1998), and 2) GEE.
Do you know of any implementations of the Rao-Obuchowski lineage in R (or, I suppose, SAS)? GEE is easy to find, but have you had a positive or negative experience with any particular packages? Is there another route to analyzing these data that I'm completely missing?
You could always just use a clustered bootstrap. Resample across families, which you believe are independent. That is, keep families together when you resample. Compute p2 - p1 for each sample. After 1000 iterations or so, compute the upper and bottom 2.5% quantiles. This will give you a bootstrapped 95% confidence interval. Alternatively compute the fraction of samples above zero, or whatever your hypothesis is. The procedure should have good pretty good properties unless the number of families is small.
It's probably easiest to do this by hand in R rather than relying on any package.
Check out the survey package: it is designed to take into account correlations induced by clustered sampling.
Have you already checked the CorrBin package in R?
It is for analysis of correlated binary data, there is a paper named: Using the CorrBin package for nonparametric analysis of
correlated binary data by Szabo, it includes the Rao-Scott, stochastic ordering and three versions of a GEE-based test.
The clust.bin.pair package for clustered binary matched-pair data was recently published to CRAN.
It contains implementations of Eliasziw and Donner (1991) and Obuchowski (1998), as well as two more recent tests in the same family Durkalski (2003) and Yang (2010).

Resources