Error in R long vectors not supported yet - r

I am using CentOS 7 Linux compute cluster, with 130 GB RAM. I am trying to use the SVM function from the e1071 R package. My matrix dimension is rows = 350 and columns = 54250.
R script Code (file_testR.R)
matris=matrix(rnorm(100),350,54251)
matris <- as.data.frame(matris)
matris$new_variable <- 0
matris$new_variable[1:175] <- "yes"
matris$new_variable[176:350] <- "no"
require(e1071)
svmfit_test <- svm(as.factor(matris$new_variable)~., data = matris, kernel = "linear", cross=10)
Bash code
Rscript --max-ppsize=500000 file_testR.R
I am getting this below error:
Error in model.matrix.default(Terms, m) :
long vectors not supported yet: ../../src/include/Rinlinedfuns.h:522
Calls: svm ... svm.formula -> model.matrix -> model.matrix.default
I would appreciate it if anybody could help me to understand this issue.

Related

Is it possible to work with vector larger than system RAM in R?

I'm using the R package kuenm to produce and project species distribution models.
I've produced the models without a problem, but when I try to evaluate the extrapolation risk for future projections with the function kuenm_mop I get the error:
Error: cannot allocate vector of size 92GB
The system I'm using has Windows 8.1 Pro and 64GB of RAM (which I believe is the limiting facotr here).
My question is: is it possible to work with a vector of greater size than my RAM?
This is the function I'm using:
library(kuenm)
sets_var <- "Set_1" #set of variables used
out_mop <- "MOP_results" #output directory
percent <- 10
paral <- FALSE
is_swd <- FALSE
M_var_dir <- "M_variables"
G_var_dir <- "G_variables"
kuenm_mmop(G.var.dir = G_var_dir, M.var.dir = M_var_dir, sets.var = sets_var, is.swd = is_swd, out.mop = out_mop, percent = percent, parallel = paral)

How to convert a tensor to an R array (in a loss function, so without eager execution)?

I have TensorFlow version 2.4 and work with the R packages tensorflow (2.2.0) and keras (2.3.0.0.9000).
I would like to convert tensors to R arrays in a loss function (don't ask why).
Here is an example when such a conversion (outside a loss function) works:
library(tensorflow)
library(keras)
x.R <- matrix(1:12, ncol = 3) # dummy R object
x.tensor <- keras_array(x.R) # converting the R object to a tensor
as.array(x.tensor) # converting it back to an R array. This works because...
stopifnot(tf$executing_eagerly()) # ... eager execution is enabled
During training of a model, eager execution is FALSE though and thus
the as.array() call fails. To see this, let's first define a dummy
neural network model and training data.
d <- 2 # input and output dimension
in.lay <- layer_input(shape = d)
hid.lay <- layer_dense(in.lay, units = 300, activation = "relu")
out.lay <- layer_dense(hid.lay, units = d, activation = "sigmoid")
model <- keras_model(in.lay, out.lay)
n <- 1200 # number of training samples
data <- matrix(runif(n * d), ncol = d) # training data
Now let's define the loss function and compile the model with it.
myloss <- function(x, y) { # x and y are tensors here
stopifnot(!tf$executing_eagerly()) # confirms that eager execution is disabled
x. <- as.array(x) # ... fails with "RuntimeError: Evaluation error: invalid first argument, must be vector (list or atomic)." How can we convert x to an R array?
loss_mean_squared_error(x, y) # just a dummy return value (the MSE)
}
compile(model, optimizer = "adam", loss = myloss)
Let's try and fit this model (to see that it fails to convert the tensor x to an R array via as.array()).
prior <- matrix(rexp(n * d), ncol = d) # input sample to train the NN on
n.epoch <- 5 # number of epochs to train
batch.size <- 400 # batch size
fit(model, x = prior, y = data, batch_size = batch.size, epochs = n.epoch) # fails with error message given above
The R package tensorflow provides tfe_enable_eager_execution() to enable eager execution
in a session. But if I call it with TensorFlow 2.4, then I obtain:
tfe_enable_eager_execution() # "Error in py_get_attr_impl(x, name, silent) : AttributeError: module 'tensorflow' has no attribute 'contrib'"
Ideally, I wouldn't want to mess with eager execution much (not sure about the side effects),
just converting a tensor to an array. My guess is that there is no other way than eager execution as
only then the pointers are resolved and the R package tensorflow finds the data
in the tensor and is able to convert it to an array.
Other ideas to enable/disable eager execution are mentioned here but that's all in Python
and not available in R it seems. And this this post seems to ask the same question but in a different context.

R mlogit package: use LAPACK instead of LINPACK

I am estimating a fairly simple McFadden choice model using a very large data set (101.6 million unit-alternatives). I can estimate this model just fine in Stata using the asclogit command, but when I try to use the mlogit package in R, I get the following error:
region1 <- mlogit(chosen ~ mean_log.wage + mean_log.rent + bornNear + Dim.1 + regionFE | 0,
shape= "long", chid.var = "chid", alt.var = "alternatives", data = ready)
Error in qr.default(na.omit(X)) : too large a matrix for LINPACK
Calls: mlogit ... model.matrix -> model.matrix.mFormula -> qr -> qr.default
If I look at the source code of qr.R it's clear that the number of elements in my design matrix is too big relative to the LINPACK limit of 2,147,483,647. However, no such limit exists for LAPACK (that I can tell, at least).
From qr.R:
qr.default <- function(x, tol = 1e-07, LAPACK = FALSE, ...)
{
x <- as.matrix(x)
if(is.complex(x))
return(structure(.Internal(La_qr_cmplx(x)), class = "qr"))
## otherwise :
if(LAPACK)
return(structure(.Internal(La_qr(x)), useLAPACK = TRUE, class = "qr"))
## else "Linpack" case:
p <- as.integer(ncol(x))
if(is.na(p)) stop("invalid ncol(x)")
n <- as.integer(nrow(x))
if(is.na(n)) stop("invalid nrow(x)")
if(1.0 * n * p > 2147483647) stop("too large a matrix for LINPACK")
...
qr() appears to be called in the mFormula method of mlogit, when model.matrix is being created, and probably while checking NAs. But I can't tell if there is a way to pass LAPACK = TRUE to mlogit, or if there is a way to skip the NA checking.
I'm hoping #YvesCroissant will see this.
As I mentioned, I can estimate this model just fine in Stata, so it's not a question of resources. My Stata license is not portable, however, which is why I would like to use R.
Thanks to Julius' comment and this post on namespaces in R, I figured out the answer. I added the following code right after my library statements:
source("mymFormula.R")
tmpfun <- get("model.matrix.mFormula", envir = asNamespace("mlogit"))
environment(mymFormula) <- environment(tmpfun)
attributes(mymFormula) <- attributes(tmpfun) # don't know if this is really needed
assignInNamespace("model.matrix.mFormula", mymFormula, ns="mlogit")
mymFormula.R is an R script where I copy/pasted the contents of mlogit:::model.matrix.mFormula and added mymFormula <- before the function invocation at the top of the file.
I viewed the contents of mlogit:::model.matrix.mFormula by typing trace(mlogit:::model.matrix.mFormula, edit=TRUE) in RStudio. (Thanks to this answer for help on how to do that.)

'Error: could not find function runmean' from an installed package: caTools?

I installed 'caTools' R package through the command line:
$ R
$ install.packages("caTools", lib="~/R/library")
Then, did this command:
INPUT=/home/user/file.bam
OUTPUT=/home/user/file_cor.bam
Rscript run_spp_nodups.R -c=$INPUT -savp -out=$OUTPUT
And got the error:
Error: could not find function "runmean"
Execution halted
The function 'runmean' belongs to package I installed, 'caTools'.
The R version is appropriate, as R in my machine is version 3.3.2 and 'caTools' depends on R (≥ 2.2.0).
The R code of 'run_spp_nodups.R' is to big to paste here. I show only the part with runmean:
# Smooth the cross-correlation curve if required
cc <- crosscorr$cross.correlation
crosscorr$min.cc <- crosscorr$cross.correlation[ length(crosscorr$cross.correlation$y) , ] # minimum value and shift of cross-correlation
cat("Minimum cross-correlation value", crosscorr$min.cc$y,"\n",file=stdout())
cat("Minimum cross-correlation shift", crosscorr$min.cc$x,"\n",file=stdout())
sbw <- 2*floor(ceiling(5/iparams$sep.range[2]) / 2) + 1 # smoothing bandwidth
cc$y <- runmean(cc$y,sbw,alg="fast")
What's happening and how to solve it?

R Script for KNN algorithm to run on Tableau

I have been writing R script to integrate with tableau. When I try to run KNN using R's dbscan package, it throws the error: K must be greater than 1 and less than the number of samples.
SCRIPT_REAL('
library(dbscan);
lof(as.matrix(data.frame(x= .arg1, y= .arg2)), k= 6)
',
SUM([Measure 1]),
SUM([Measure 2]),
)

Resources