What can I do about svmpath needing so much memory in one gulp? - r

I am trying out the svmpath package, which is supposed to find optimal hyperparameters for a trained SVM without requiring multiple runs over different subsets of the data. More importantly, it's supposed to be less computationally complex (according to its docs).
However, it seems to ask for a lot of memory all at once.
Minimal working example:
library(data.table)
library(svmpath)
# Loaded svmpath 0.953
features <- data.table(matrix(runif(100000*16),ncol=16))
labels <- (runif(100000) > 0.7)
svmpath(x=features,y=labels)
# Error in x %*% t(y) : requires numeric/complex matrix/vector arguments
svmpath(x=as.matrix(features),y=labels)
# Error: cannot allocate vector of size 74.5 Gb
library(kernlab)
ksvm(as.matrix(features),y=labels,kernel=vanilla)
# runs
Inspecting the training function only shows one line that pops out as possibly big, Kscript <- K * outer(y, y). Indeed this seems to be the culprit: runif(100000) %o% runif(100000) produces the same error.
Are there any quick fixes that are easy to implement in R?

Apparently, it doesn't find the optimal C (cost) value.
But, it lists all the C values that you should try in order
to find the best one using N folds cross validation or a test dataset.

Related

MLE using nlminb in R - understand/debug certain errors

This is my first question here, so I will try to make it as well written as possible. Please be overbearing should I make a silly mistake.
Briefly, I am trying to do a maximum likelihood estimation where I need to estimate 5 parameters. The general form of the problem I want to solve is as follows: A weighted average of three copulas, each with one parameter to be estimated, where the weights are nonnegative and sum to 1 and also need to be estimated.
There are packages in R for doing MLE on single copulas or on a weighted average of copulas with fixed weights. However, to the best of my knowledge, no packages exist to directly solve the problem I outlined above. Therefore I am trying to code the problem myself. There is one particular type of error I am having trouble tracing to its source. Below I have tried to give a minimal reproducible example where only one parameter needs to be estimated.
library(copula)
set.seed(150)
x <- rCopula(100, claytonCopula(250))
# Copula density
clayton_density <- function(x, theta){
dCopula(x, claytonCopula(theta))
}
# Negative log-likelihood function
nll.clayton <- function(theta){
theta_trans <- -1 + exp(theta) # admissible theta values for Clayton copula
nll <- -sum(log(clayton_density(x, theta_trans)))
return(nll)
}
# Initial guess for optimization
guess <- function(x){
init <- rep(NA, 1)
tau.n <- cor(x[,1], x[,2], method = "kendall")
# Guess using method of moments
itau <- iTau(claytonCopula(), tau = tau.n)
# In case itau is negative, we need a conditional statement
# Use log because it is (almost) inverse of theta transformation above
if (itau <= 0) {
init[1] <- log(0.1) # Ensures positive initial guess
}
else {
init[1] <- log(itau)
}
}
estimate <- nlminb(guess(x), nll.clayton)
(parameter <- -1 + exp(estimate$par)) # Retrieve estimated parameter
fitCopula(claytonCopula(), x) # Compare with fitCopula function
This works great when simulating data with small values of the copula parameter, and gives almost exactly the same answer as fitCopula() every time.
For large values of the copula parameter, such as 250, the following error shows up when I run the line with nlminb():"Error in .local(u, copula, log, ...) : parameter is NA
Called from: .local(u, copula, log, ...)
Error during wrapup: unimplemented type (29) in 'eval'"
When I run fitCopula(), the optimization is finished, but this message pops up: "Warning message:
In dlogcdtheta(copula, u) :
dlogcdtheta() returned NaN in column(s) 1 for this explicit copula; falling back to numeric derivative for those columns"
I have been able to find out using debug() that somewhere in the optimization process of nlminb, the parameter of interest is assigned the value NaN, which then yields this error when dCopula() is called. However, I do not know at which iteration it happens, and what nlminb() is doing when it happens. I suspect that perhaps at some iteration, the objective function is evaluated at Inf/-Inf, but I do not know what nlminb() does next. Also, something similar seems to happen with fitCopula(), but the optimization is still carried out to the end, only with the abovementioned warning.
I would really appreciate any help in understanding what is going on, how I might debug it myself and/or how I can deal with the problem. As might be evident from the question, I do not have a strong background in coding. Thank you so much in advance to anyone that takes the time to consider this problem.
Update:
When I run dCopula(x, claytonCopula(-1+exp(guess(x)))) or equivalently clayton_density(x, -1+exp(guess(x))), it becomes apparent that the density evaluates to 0 at several datapoints. Unfortunately, creating pseudobservations by using x <- pobs(x) does not solve the problem, which can be see by repeating dCopula(x, claytonCopula(-1+exp(guess(x)))). The result is that when applying the logarithm function, we get several -Inf evaluations, which of course implies that the whole negative log-likelihood function evaluates to Inf, as can be seen by running nll.clayton(guess(x)). Hence, in addition to the above queries, any tips on handling log(0) when doing MLE numerically is welcome and appreciated.
Second update
Editing the second line in nll.clayton as follows seems to work okay:
nll <- -sum(log(clayton_density(x, theta_trans) + 1e-8))
However, I do not know if this is a "good" way to circumvent the problem, in the sense that it does not introduce potential for large errors (though it would surprise me if it did).

Elastic net with Cox regression

I am trying to perform elastic net with cox regression on 120 samples with ~100k features.
I tried R with the glmnet package but R is not supporting big matrices (it seems R is not designed for 64 bit). Furthermore, the package glmnet does support sparse matrices but for whatever reason they have not implemented sparse matrix + cox regression.
I am not pushing for R but this is the only tool I found so far. Anyone knows what program I can use to calculate elastic nets + cox regression on big models? I did read that I can use Support Vector Machine but I need to calculate the model first and I cannot do that in R due to the above restriction.
Edit:
A bit of clarification. I am not reporting an error in R as apparently it is normal for R to be limited by how many elements its matrix can hold (as for glmnet not supporting sparse matrix + cox I have no idea). I am not pushing for a tool but it would be easier if there is another package or a stand alone program that can perform what I am looking for.
If someone has an idea or has done this before please share your method (R, Matlab, something else).
Edit 2:
Here is what I used to test:
I made a matrix of 100x100000. Added labels and tried to create the model using model.matrix.
data <- matrix(rnorm(100*100000), 100, 100000)
formula <- as.formula(class ~ .)
x = c(rep('A', 40), rep('B', 30), rep('C', 30))
y = sample(x=1:100, size=100)
class = x[y]
data <- cbind(data, class)
X <- model.matrix(formula, data)
The error I got:
Error: cannot allocate vector of size 37.3 Gb
In addition: Warning messages:
1: In terms.formula(object, data = data) :
Reached total allocation of 12211Mb: see help(memory.size)
2: In terms.formula(object, data = data) :
Reached total allocation of 12211Mb: see help(memory.size)
3: In terms.formula(object, data = data) :
Reached total allocation of 12211Mb: see help(memory.size)
4: In terms.formula(object, data = data) :
Reached total allocation of 12211Mb: see help(memory.size)
Thank you in advance! :)
Edit 3:
Thanks to #marbel I was able to construct a test model that works and does not become too big. It seems my problem came from using cbind in my test.
A few pointers:
a) That's a rather small dataset, R should be more than enought. All you need is a modern computer, meaning a decent amount of RAM. I guess 4GB should be enough for such a small dataset.
The package is available in Julia and Python but I'm not sure if that model is available.
Here and here you have examples of the cox model with the GLMNET package. There is also a package called survival.
There are at least two problems with your code:
This is not something your would like to do in R: data <- cbind(data, class). It's just not memory efficient. If you need to do this type of operations use the data.table package. It allows to do assignment by references, check out the := operator.
If all your data is numeric you don't need to use model.matrix, just use data.matrix(X).
If you have categorical variables, use model.matrix with them only, then add them to the X matrix, perhaps using data.table, one column at a time using the ?data.table::set or the := operator.
Hopefully this can help you debug the code. Good luck!

modelTest and negative edges lengths with phangorn in R

I'm analysing protein data sets. I'm trying to build a tree with the package phangorn in R.
When I construct it, I get negative edge lengths that sometimes makes difficult to proceed with the analysis (modelTest).
Depending on the size of the dataset (more than 250 proteins), I can't perform a modelTest. Apparently there is a problem due to negative edge lengths. However, for shorter datasets I can perform a modelTest even though there are some negative edge lengths.
I am runing it directly from my terminal.
library(phangorn)
dat = read.phyDat(file, format="fasta", type="AA")
tax <- read.table("organism_names.txt", sep="\t", row.names=1)
names(dat) <- tax[,1]
distance <- dist.ml(dat, model="WAG")
tree <- bionj(distance)
mt <- modelTest(dat, tree, model=c("WAG", "LG", "cpREV", "mtArt", "MtZoa", "mtREV24"),multicore=TRUE)
Error: NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In pml(tree, data) : negative edges length changed to 0!
Does somebody have any idea of what can I do?
cheers, Alba
As #Marc said, your example isn't really reproducible...
If the problem really is negative or zero branch lengths, you could try to make them a really small positive number, for instance:
tree$edge.length[which(tree$edge.length <= 0)] <- 0.0000001
Another tip is to subscribe to R-sig-phylo, a mail list about phylogenies in R. People there are really knowledgeable an usually respond pretty fast.

full precision may not have been achieved in 'qbeta'

I am running R version 2.14.0 on a PC which uses Windows 7 Ultimate (Intel Core i5-2400 3GHz processor with 8.00GB ram). Let me know if other specs needed.
I am trying to simulate correlated beta distributed data. The method I am using is an extension of what is written in this paper:
http://onlinelibrary.wiley.com/doi/10.1002/asmb.901/pdf
Basically, I start by simulating multivariate normal data (using mvrnorm() function from MASS).
Then I use pnorm() to apply the probit transform to these data such that my new data vector(s) live on (0,1). And are still correlated according to the previous statement.
Then given these probit transformed data I apply the qbeta() function with certain shape1 and shape2 parameters, to get back correlated beta data with certain mean and dispersion properties.
I know other methods for generating correlated beta data exist. I am interested in why qbeta() causes this method to fail for certain "seeds". Below is the error message I get.
Warning message:
In qbeta(probit_y0, shape1 = a0, shape2 = b0) :
full precision may not have been achieved in 'qbeta'
What does this mean? How can it be avoided? When it does occur within the context of a larger simulation what is the best way to ensure that this problem does not terminate the entire sourced (using source()) simulation code?
I ran the following code for integer seeds from 1:1000. Seed=899 was the only value which gave me problems. Though if its problematic here, it inevitably will be problematic for other seeds too.
library(MASS)
set.seed(899)
n0 <- 25
n1 <- 25
a0 <- 0.25
b0 <- 4.75
a1 <- 0.25
b1 <- 4.75
varcov_mat <- matrix(rep(0.25,n0*n0),ncol=n0)
diag(varcov_mat) <- 1
y0 <- mvrnorm(1,mu=rep(0,n0),Sigma=varcov_mat)
y1 <- mvrnorm(1,mu=rep(0,n1),Sigma=varcov_mat)
probit_y0 <- pnorm(y0)
probit_y1 <- pnorm(y1)
beta_y0 <- qbeta(probit_y0, shape1=a0, shape2=b0)
beta_y1 <- qbeta(probit_y1, shape1=a1, shape2=b1)
The above code is a fragment of a piece of a larger simulation project. But the qbeta() warning message is what is giving me a headache now.
Any help the group could provide would be greatly appreciated.
Cheers
Chris
The reason for the error is that the algorithm used to calculate qbeta did not converge for those values of the parameters.
R uses AS 109 to calculate qbeta (Cran, G. W., K. J. Martin and G. E. Thomas (1977). Remark AS R19 and Algorithm AS 109, Applied Statistics, 26, 111–114, and subsequent remarks (AS83 and correction).). R tries to calculate the value in 1000 iterations. If it cannot in the 1000 iterations, you get the error message you saw.
Here is the code for qbeta.

calculate mean for each cell in grid in R [duplicate]

Does somebody know whether there is a sliding window method in R for 2d matrices and not just vectors. I need to apply median function to an image stored in matrix
The function focal() in the excellent raster package is good for this. It takes several arguments beyond those shown in the example below, and can be used to specify a non-rectangular sliding window if that's needed.
library(raster)
## Create some example data
m <- matrix(1, ncol=10, nrow=10)
diag(m) <- 2
r <- as(m, "RasterLayer") # Coerce matrix to RasterLayer object
## Apply a function that returns a single value when passed values of cells
## in a 3-by-3 window surrounding each focal cell
rmean <- focal(r, w=matrix(1/9, ncol=3, nrow=3), fun=mean)
rmedian <- focal(r, w=matrix(1/9, ncol=3, nrow=3), fun=median)
## Plot the results to confirm that this behaves as you'd expect
par(mfcol=c(1,3))
plot(r)
plot(rmean)
plot(rmedian)
## Coerce results back to a matrix, if you so desire
mmean <- as(rmean, "matrix")
I know this is an old question, but I have come across this many times when looking to solve a similar problem. While the focal function in the raster package IS very straightforward and convienent, I have found it to be very slow when working with large rasters. There are many ways to try and address this, but one way that I found is by using system commands to "whitebox tools" which is a command-line driven set of raster analysis tools. It's main advantage it that it executes the tools in parallel and really takes advantage of multi-core CPUs. I know R has many cluster functions and packages (which I use for randomforest model raster prediction), but I have struggled with much of the parallel computing implementation in R. Whitebox tools has discrete functions for mean, max, majority, median, etc... filters (not to mention loads of terrain processing tools which is great for DEM-centric analyses).
Some example code for how I implemented a modal or majority filter (3x3 window) in R in of a large classified land cover raster (nrow=3793, ncol=6789, ncell=25750677) using whitebox tools:
system('C:/WBT2/target/release/whitebox_tools --wd="D:/Temp" ^
--run=MajorityFilter -v --input="input_rast.tif" ^
--output="maj_filt_rast.tif" --filterx=3 --filtery=3',
wait = T, timeout=0, show.output.on.console = T)
The above code took less than 3.5 seconds to execute, meanwhile the equivalent raster package "focal" function using "modal", also from the raster package, took 5 minutes to complete coded below as:
maj_filt_rast<- focal(input_rast, fun=modal, w=matrix(1,nrow=3,ncol=3))
Getting whitebox tools compiled and installed IS a bit annoying, but good instructions are provided. In my opinion it is well worth the effort as it makes raster processes that were previously prohibitively slow in R run amazingly fast and it allows me to keep the coding for everything inside R with system commands.

Resources