I've recently started using R for data analysis. Now I've got a problem in ranking a big query dataset (~1 GB in ASCII mode, over my laptop's 4GB RAM in binary mode). Using bigmemory::big.matrix for this dataset is a nice solution, but providing such a matrix 'm' in the gbm() or randomForest() algorithms causes the error:
cannot coerce class 'structure("big.matrix", package = "bigmemory")' into a data.frame
class(m) outputs the folowing:
[1] "big.matrix"
attr(,"package")
[1] "bigmemory"
Is there a way to correctly pass a big.matrix instance into these algorithms?
I obviously can't test this using data of your scale, but I can reproduce your errors by using the formula interface of each function:
require(bigmemory)
m <- matrix(sample(0:1,5000,replace = TRUE),1000,5)
colnames(m) <- paste("V",1:5,sep = "")
bm <- as.big.matrix(m,type = "integer")
require(gbm)
require(randomForest)
#Throws error you describe
rs <- randomForest(V1~.,data = bm)
#Runs without error (with a warning about the response only having two values)
rs <- randomForest(x = bm[,-1],y = bm[,1])
#Throws error you describe
rs <- gbm(V1~.,data = bm)
#Runs without error
rs <- gbm.fit(x = bm[,-1],y = bm[,1])
Not using the formula interface for randomForest is fairly common advice for large data sets; it can be quite inefficient. If you read ?gbm, you'll see a similar recommendation steering you towards gbm.fit for large data as well.
It is often the case that the memory occupied by numeric objects is more than the disk space. Each "double" element in a vector or matrix takes 8 bytes. When you coerce an object to a data.frame, it may need to be copied in RAM. You should avoid trying to use functions and data structures that are outside those supported by the bigmemory/big*** suite of packages. "biglm" is available, but I doubt that you can expect gbm() or randomForest() to recognize and use the facilities in the "big"-family.
Related
On my machine,
m1 = ( runif(5*10^7), ncol=10000, nrow=5000 )
uses up about 380 MB. I need to work with many of such matrices at the same time in memory (e.g. add or multiply them or apply functions on them). All in all my code uses up 4 GB of RAM due to multiple matrices stored in memory. I am contemplating options to store the data more efficiently (i.e. in a way that uses up less RAM).
I have seen the R package bigmemory being recommended. However:
library(bigmemory)
m2 = big.matrix( init = 0, ncol=10000, nrow=5000 )
m2[1:5000,1:10000] <- runif( 5*10^7 )
makes R use about the same amount in memory as I verified using Windows Task Manager. So I anticipate no big gain, or am I wrong and should I use big.matrix in a different way?
The solution is to work with matrices stored in files, i.e. setting backingfile to not NULL in the call of big.matrix() function.
Working with filebacked big.matrix from package bigmemory is a good solution.
However, assigning the whole matrix with runif( 5*10^7 ) makes you create this large temporary vector in memory. Yet, if you use gc(reset = TRUE), you will see that this memory usage disappear.
If you want to initialize your matrix by block (say blocks of 500 columns), you could use package bigstatsr. It uses similar objects as filebacked big.matrix (called FBM) and store them in your temporary directory by default. You could do:
library(bigstatsr)
m1 <- FBM(1e4, 5e3)
big_apply(m1, a.FUN = function(X, ind) {
X[, ind] <- runif(nrow(X) * length(ind))
NULL
}, a.combine = 'c', block.size = 500)
Depending on the makeup of your dataset, a sparse.matrix could be your best way forward. This is a common and extremely useful way to boost space and time efficiency. In fact, a lot of R packages require that you use sparse matrices.
I should say that although I'm learning glmnet for this problem, I've used the same dataset with other methods and it has worked fine.
In this process, I split my data into training and test sets, all formatted as matrices, and glmnet builds the model without complaining. However, when I try to run a prediction on the holdout set, it throws the following error:
glmfit <- glmnet(train_x_mat,train_y_mat, alpha=1)
glmpred <- predict(glmfit, s=glmfit$lambda.1se, new = test_x_mat)
# output:
Error in cbind2(1, newx) %*% nbeta :
Cholmod error 'A and B inner dimensions must match' at file ../MatrixOps/cholmod_ssmult.c, line 82
However, I know that x_train and x_test have the same number of columns:
ncol(test_x)
[1] 146
ncol(train_x)
[1] 146
I'm fairly new to glmnet; is there something more I need to do to make it cooperate?
Edit:
Here are the dimensions. Apologies for posting the vectors originally. This may be more at the heart of it.
dim(train_x_mat)
[1] 1411 208
dim(test_x_mat)
[1] 352 204
Which is strange, because they are created this way:
train_x_mat <- sparse.model.matrix(~.-1, data = train_x, verbose = F)
test_x_mat <- sparse.model.matrix(~.-1, data = test_x, verbose = F)
For anyone else who's running into this problem even though it seems like they shouldn't be, the issue is specifically with R's sparse.model.matrix. It will separate each level of a factor and give it its own column. Thus, if your dataset isn't particularly large, your training data and testing data could have different columns.
A solution, then, is to either add extra, blank columns to whichever matrix needs them, or else remove the columns that aren't shared by both. Of course, if you're building a model and expecting new data, the former is preferable. But anyway, the whole problem is a sign that your dataset is too small for the job.
With R I want to generate a matrix where the epsilons are the columns and the rows are the input data. However when I try to assign a value to matrix an error appears:
"Error in results[, j] <- (probabilities > epsilons[j]) :
replacement has length zero"
I tried many ways but I am stuck with this. Please note that this problem happens when oracle R objects are in use. See a small code below that reproduces the problem:
library(ORE)
ore.connect(user="XXXX", service_name="XXXXX", host="XXXXXXXX", password="XXXXX", port=XXXX, all=TRUE)
ore.sync('MYDATABASE')
ore.attach()
ore.pull(MY_TABLE)
trainingset <- MY_TABLE$MY_COLUMN[1:1000]
crossvalidationset <- MY_TABLE$MY_COLUMN[1001-2000]
# Training
my_column_avg <- mean(trainingset)
my_column_std <- sd(trainingset)
# Validation
probabilities <- dnorm(crossvalidationset,my_column_avg,my_column_std)
epsilons <- c(0.01,0.05,0.1,0.25,0.5,0.75,0.8)
num_rows <- length(probabilities)
num_cols <- length(epsilons)
results <- matrix(TRUE, num_rows, num_cols)
# Anomaly detection results for several epsilons
for(j in 1:num_cols)
{
results[,j] <- (probabilities > epsilons[j])
}
Object MY_TABLE is an oracle table object not a data-frame as well as probabilities since it was derived from MY_TABLE.
However when a value assignment was tried to an R matrix than the error was happening as shown in the line below:
results[,j] <- (probabilities > epsilons[j])
The reason of the error described above was due to the use of oracle R library (ORE).
If common R data structures are used in the code above since the beginning than this problem never happens. For instance by replacing MY_TABLE oracle object to a data-frame.
Therefore it is a good practice to get rid of Oracle R objects and use R data frames whenever possible.
I need to multi-thread my R application as it takes 5 minutes to run and is only using 15% of the computers available CPU.
An example of a process which takes a while to run is calculating the mean of a very large raster stack containing n layers:
mean = cellStats(raster_layers[[n]], stat='sd', na.rm=TRUE)
Using the parallel library, I can create a new cluster and pass a function to it:
cl <- makeCluster(8, type = "SOCK")
parLapply(cl, raster_layers[[1]], mean_function)
stopCluster(cl)
where mean function is:
mean_function <- function(raster_object)
{
result = cellStats(raster_object, stat='mean', na.rm=TRUE)
return(result)
}
This method works fine except that it can't see the 'raster' package which is required to use cellStats. So it fails saying no function for cellStats. I have tried including the library within the function but this doesnt help.
The raster package comes with a cluster function, and it CAN see the function cellStats, however as far as I can tell, the cluster function must return a raster object and must be passed a single raster object which isn't flexible enough for me, I need to be able to pass a list of objects and return a numeric variable... which I can do with normal clustering using the parallel library if only it can see the raster package functions.
So, does anybody know how I can pass a package to a node with multi-threading in R? Or, how I can return a single value from the raster cluster function perhaps?
The solution came from Ben Barnes, thank you.
The following code works fine:
mean_function <- function(variable)
{
result = cellStats(variable, stat='mean', na.rm=TRUE)
return(result)
}
cl <- makeCluster(procs, type = "SOCK")
clusterEvalQ(cl, library(raster))
result = parLapply(cl, a_list, mean_function)
stopCluster(cl)
Where procs is the number of processors you wish to use, which must be the same value as the length of the list you are passing (in this case called a_list).
a_list in this case needs to be a list containing rasters which can be operated on to calculate the mean using the cellStats function. So, a_list is simply a list of rasters, containing procs number of rasters.
I have the following R code:
data <- read.csv('testfile.data', header = T)
mat = as.matrix(data)
Some more statistics of my testfile.data:
> ncol(data)
[1] 75713
> nrow(data)
[1] 44771
Since this is a large dataset, so I am using Amazon EC2 with 64GB Ram space. So hopefully memory isn't an issue. I am able to load the data (1st line works).
But as.matrix transformation (2nd line errors) throws the following error:
resulting vector exceeds vector length limit in 'AnswerType'
Any clue what might be the issue?
As noted, the development version of R supports vectors larger than 2^31-1. This is more-or-less transparent, for instance
> m = matrix(0L, .Machine$integer.max / 4, 5)
> length(m)
[1] 2684354555
This is with
> R.version.string
[1] "R Under development (unstable) (2012-08-07 r60193)"
Large objects consume a lot of memory (62.5% of my 16G, for my example) and to do anything useful requires several times that memory. Further, even simple operations on large data can take appreciable time. And many operations on long vectors are not yet supported
> sum(m)
Error: long vectors not supported yet:
/home/mtmorgan/src/R-devel/src/include/Rinlinedfuns.h:100
So it often makes sense to process data in smaller chunks by iterating through a larger file. This gives full access to R's routines, and allows parallel evaluation (via the parallel package). Another strategy is to down-sample the data, which should not be too intimidating to a statistical audience.
Your matrix has more elements than the maximum vector length of 2^31-1. This is a problem because a matrix is just a vector with a dim attribute. read.csv works because it returns a data.frame, which is a list of vectors.
R> 75713*44771 > 2^31-1
[1] TRUE
See ?"Memory-limits" for more details.