Defining large matrix with "big memory" package in R - r

I am using the big memory package and need to define a large matrix (20000 * 20000).
A <- big.matrix (20000 , 20000 , type ="double", init = 0)
Resulting in:
Error: memory could not be allocated for instance of type big.matrix
My questions:
(1.) Does the package enables a matrix of that size in general?
(2.) If not, are there any other options to create such a matrix in R?
Many thanks for your help

This answer espands on Imo's explanation of specifying file-backing.
Unfortunately, the current CRAN version of the package (4.5.36) doesn't contain a vignette anymore, but thankfully it's possible to download older versions that contain it. For example, the vignette for version 4.5.28 contains the following piece of code:
x <- read.big.matrix("airline.csv", type="integer", header=TRUE,
backingfile="airline.bin",
descriptorfile="airline.desc",
extraCols="Age")
If you wish to keep your working directory clean, you can use the temppath() and tempdir() functions. Here's one example:
temp_file <- gsub("/", "", tempfile(tmpdir = ""))
A <- big.matrix(
20000 , 20000 , type ="double", init = 0,
backingpath = tempdir(),
backingfile = paste0(temp_file, ".bak"),
descriptorfile = paste0(temp_file, ".desc"),
)

Related

How to export thousands of constants in an R package?

Goal
I want to expose built-in constants from a package I am developing that come originally from C source code, being defined with #define directives.
OpenGL constants defined as #define directives
I'm wrapping this C library GLFW using Rcpp. This C library, in turn, includes OpenGL declarations which includes many #defines, here's a short snippet from gl.h:
#define GL_T4F_C4F_N3F_V4F 0x2A2D
#define GL_MATRIX_MODE 0x0BA0
#define GL_MODELVIEW 0x1700
#define GL_PROJECTION 0x1701
#define GL_TEXTURE 0x1702
#define GL_POINT_SMOOTH 0x0B10
#define GL_POINT_SIZE 0x0B11
#define GL_POINT_SIZE_GRANULARITY 0x0B13
#define GL_POINT_SIZE_RANGE 0x0B12
#define GL_LINE_SMOOTH 0x0B20
Wrapping of C macro definitions in C++ functions
Now, I've been exposing these C macro definitions by wrapping them in c++ functions, e.g.:
// [[Rcpp::export]]
Rcpp::IntegerVector macro_gl_matrix_mode() {return Rcpp::wrap((unsigned int) GL_MATRIX_MODE);}
Exporting the variables
And then, I have an R source file in data-raw/ that essentially calls those not exported functions and saves each object to disk: (abbreviated for clarity):
library(glfw)
library(tibble)
library(usethis)
library(fs)
library(dplyr)
#
# use_data2 accepts a list of strings with the names of the objects to be
# exported instead of the interface provided by usethis::use_data that expects
# multiple arguments passed in `...`.
#
use_data2 <- function(objs,
internal = FALSE,
overwrite = FALSE,
compress = "bzip2",
version = 2,
envir = parent.frame())
{
usethis:::check_is_package("use_data()")
if (internal) {
usethis::use_directory("R")
paths <- fs::path("R", "sysdata.rda")
objs <- list(objs)
}
else {
usethis::use_directory("data")
paths <- fs::path("data", objs, ext = "rda")
}
usethis:::check_files_absent(proj_path(paths), overwrite = overwrite)
usethis::ui_done("Saving {ui_value(unlist(objs))} to {ui_value(paths)}")
mapply(save, list = objs,
file = proj_path(paths),
MoreArgs = list(envir = envir, compress = compress, version = version)
)
invisible()
}
gl <- new.env()
# Begin of loads of assign calls
# (...)
assign('GL_MATRIX_MODE', glfw:::macro_gl_matrix_mode(), envir = gl)
# (...)
# End
#
# Exporting
#
gl_lst <- as.list(gl)
gl_names <- names(gl_lst)
use_data2(gl_names, internal = FALSE, compress = "xz", overwrite = TRUE, version = 2, envir = gl)
This is working but I have 5727 of these constants to be exported. So when I load my package it just stays for more than 5 min loading at this stage:
*** moving datasets to lazyload DB
So there's got to be a better way, right? Not only this is very slow at package loading time as well as I'm guessing that having thousands of objects in my data/ folder is going to create trouble from the package standards or requirements point of view...
Let me just say that I was trying to avoid encapsulating all these constants in a list or dataframe because I wanted to keep the API interface similar to the C library in this respect, i.e., right now I think it is quite nice to be able to simply use the variables GL_MODELVIEW or GL_POINT_SIZE_GRANULARITY straight without any extra syntax.
Any help/pointers is greatly appreciated.
Note: This other question is similar to mine, but it has not an answer yet, and the scope might be slightly different because my constants are originally from C code so there might be a more some specific solution my to problem, for instance, using Rcpp in a way I haven't tried yet: Exporting an unwieldy set of constants in a package.
I had a similar problem. I inherited a project that had a large number of values defined in a file. The app sourced these files to load the data into the global environment. I am converting much of this to a package and wanted these as internal package data. So I did this simple script to create the R/sysdata.rda that is loaded when the package is loaded with "LazyData: true" in the Description file.
#Start with a clean environment
rm(list=ls(all.names = T))
#Data to be saved
strings <- c("a","b")
my_list <- list(first=c(1,2,3), second = seq(1,10))
#Get the names
data_names <- paste0(ls(),collapse =",")
#Create string for execution
command <- paste0("usethis::use_data(" , data_names ,",internal =
TRUE,overwrite=TRUE)")
#execute
eval(parse(text = command))
#cleanup
rm(list=ls(all.names = T))
I am studying the following approach:
Step 1
Instead of exporting all those constants separately, I am going to export an environment that encapsulates all the constants: environment gl (similar to #ralf-stubner's suggestion). This makes the loading (rebuilding of the package) much faster.
So, in my data-raw/gl_macros.R (the data generating script) I am adding this last line to export the environment gl:
usethis::use_data(gl, internal = FALSE, compress = "xz", overwrite = TRUE, version = 2)
Step 2
And then, to have the convenience of accessing the OpenGL macros with their original names, I add a on-attach hook to my R/zzz.R:
.onAttach <- function(libname, pkgname) {
for (n in ls(glfw::gl, all.names = TRUE)) assign(n, get(n, glfw::gl), .GlobalEnv)
} # .onAttach()
It seems to work! At least on an interactive session. This last step takes a few seconds but it's a lot quicker than the original approach. But I am thinking now that this won't work if my package is used by other packages, not sure though.
Alternative to step 2
Perhaps this will work best:
.onLoad <- function(libname, pkgname) {
attach(glfw::gl)
} # .onLoad()

How to see the actual memory size of a big.matrix object of bigmemory package?

I am using the bigmemory package to load a heavy dataset, but when I check the size of the object (with function object.size), it always returns 664 bytes. As far I as understand, the weight should be almost the same as a classic R matrix, but depending of the class (double or integer). Then, why do I obtain 664 bytes as an answer?. Below, reproducible code. The first chunck is really slow, so feel free to reduce the number of simulated values. With (10^6 * 20) will be enough.
# CREATE BIG DATABASE -----------------------------------------------------
data <- as.data.frame(matrix(rnorm(6 * 10^6 * 20), ncol = 20))
write.table(data, file = "big-data.csv", sep = ",", row.names = FALSE)
format(object.size(data), units = "auto")
rm(list = ls())
# BIGMEMORY READ ----------------------------------------------------------
library(bigmemory)
ini <- Sys.time()
data <- read.big.matrix(file = "big-data.csv", header = TRUE, type = "double")
print(Sys.time() - ini)
print(object.size(data), units = "auto")
To determine the size of the bigmemory matrix use:
> GetMatrixSize(data)
[1] 9.6e+08
Explanation
Data stored in big.matrix objects can be of type double (8 bytes, the default), integer (4 bytes), short (2 bytes), or char (1 byte).
The reason for the size disparity is that data stores a pointer to a memory-mapped file. You should be able to find the new file in the temporary directory of your machine. - [Paragraph quoted from R High Performance Programming]
Essentially, bigmatrix maintains a binary data file on the disk called a backing file that holds all of the values in a data set. When values from a bigmatrix object are needed by R, a check is performed to see if they are already in RAM (cached). If they are, then the cached values are returned. If they are not cached, then they are retrieved from the backing file. These caching operations reduce the amount of time needed to access and manipulate the data across separate calls, and they are transparent to the statistician.
See page 8 of the documentation for a description
https://cran.r-project.org/web/packages/bigmemory/bigmemory.pdf
Ref:
R High Performance Programming By: Aloysius Lim; William Tjhi
Data Science in R By: Duncan Temple Lang; Deborah Nolan

Read binary raster files in R

I want to read binary integers in R and convert them into raster grids.
The files have the following charterers:
NCols= 4320
NRows= 2160
pixel-size: 1/12=0.833 degrees
upper-left-lat: 90.0-1/24
upper-left-lon: -180.0+1/24
lower-right-lat: -90.0+1/24
lower-right-lon: 180.0
nodata= -5000
scale-factor= 10000
datatype: 16-bit signed integer
byte-order: big endian
Here is what I do:
file <-"http://nasanex.s3.amazonaws.com/AVHRR/GIMMS/3G/1980s/geo81aug15a.n07-VI3g"
dat <- readBin(file,what="integer", size=4, signed = TRUE, n = NRows * NCols, endian = "big")
r <- raster(nrow=2160, ncol=4320)
r[] <- dat
But this doesn't seem to be right, I appreciate any suggestions.
.
I built greenbrown from source (based on the files staged on GitHub) and found that it took considerably long to process one single file.
system.time(
r1 <- ReadVI3g("http://nasanex.s3.amazonaws.com/AVHRR/GIMMS/3G/1980s/geo81aug15a.n07-VI3g")
)
# user system elapsed
# 3.252 0.973 143.846
Therefore, I suggest to have a look at the gimms package which has been designed for this particular kind of data and, moreover, is available from CRAN. Note that in contrast to ReadVI3g, it does not offer automated quality control yet, but this feature is scheduled for the next version update. In the meantime, overlay from the raster package should be employed to discard low-quality values.
# install.packages("gimms")
library(gimms)
system.time({
## download file, see ?downloadGimms for further options
f <- updateInventory()
f <- downloadGimms(f[3], overwrite = TRUE) # download 3rd file in 'f', viz. geo81aug15a.n07-VI3g
## rasterize ndvi and flags
ndvi <- rasterizeGimms(f)
flag <- rasterizeGimms(f, flag = TRUE)
## perform quality control
r2 <- overlay(ndvi, flag, fun = function(x, y) {
x[y[] > 1] <- NA
return(x)
})
})
# user system elapsed
# 4.538 3.894 26.781
The two resulting images are obviously identical
> unique(r1 - r2, na.rm = TRUE)
[1] 0
but as you can see, the gimms-based code performs much faster. Moreover, it offers parallel functionality (via doParallel) in case you would like to download and process multiple files at once.
You can read such files with the greenbrown R package.
Install it in R with
install.packages("greenbrown", repos="http://R-Forge.R-project.org")
If that fails because the package needs to be rebuilt by its authors, an alternative is to first download the sources directly from the repo, and then install them manually, as explained in the greenbrown installation instructions. In the latter case you may also have to manually install a couple of packages that greenbrown depends on first: install.packages on Kendall, bfast, strucchange.
After installation, reading the raster from a URL is as easy as:
library(greenbrown)
r <- ReadVI3g("http://nasanex.s3.amazonaws.com/AVHRR/GIMMS/3G/1980s/geo81aug15a.n07-VI3g")
The object returned by greenbrown::ReadVI3g is a RasterLayer. We can display it with
plot(r)
which gives

Slow bigram frequency function in R

I’m working with Twitter data and I’m currently trying to find frequencies of bigrams in which the first word is “the”. I’ve written a function which seems to be doing what I want but is extremely slow (originally I wanted to see frequencies of all bigrams but I gave up because of the speed). Is there a faster way of solving this problem? I’ve heard about the RWeka package, but have trouble installing it, I get an error about (ERROR: dependencies ‘RWekajars’, ‘rJava’ are not available for package ‘RWeka’)…
required libraries: tau and tcltk
bigramThe <- function(dataset,column) {
bidata <- data.frame(x= character(0), y= numeric(0))
pb <- tkProgressBar(title = "progress bar", min = 0,max = nrow(dataset), width = 300)
for (i in 1:nrow(dataset)) {
a <- column[i]
bi<-textcnt(a, n = 2, method = "string")
tweetbi <- data.frame(V1 = as.vector(names(bi)), V2 = as.numeric(bi))
tweetbi$grepl<-grepl("the ",tweetbi$V1)
tweetbi<-tweetbi[which(tweetbi$grepl==TRUE),]
bidata <- rbind(bidata, tweetbi)
setTkProgressBar(pb, i, label=paste( round(i/nrow(dataset), 0), "% done"))}
aggbi<-aggregate(bidata$V2, by=list(bidata $V1), FUN=sum)
close(pb)
return(aggbi)
}
I have almost 500,000 rows of tweets stored in a column that I pass to the function. An example dataset would look like this:
text userid
tweet text 1 1
tweets text 2 2
the tweet text 3 3
To use RWeka, first run sudo apt-get install openjdk-6-jdk (or install/re-install your JDK in Windows GUI) then try re-installing the package.
Should that fail, use download.file to download the source .zip file and install from source, i.e. install.packages("RWeka.zip", type = "source", repos = NULL).
If you want to speed things up without using a different package, consider using multicore and re-writing the code to use an apply function which can take advantage of parallelism.
You can get rid of the evil loop structure by collapsing the text column into one long string:
paste(dataset[[column]], collapse=" *** ")
bi<-textcnt(a, n = 2, method = "string")
I expected to also need
subset(bi, function(x) !grepl("*", x)
But it turns out that the textcnt method doesn't include bigrams with * in them, so you're good to go.

Combining vectors of unequal size using rbind.na

I've imported some data files with an unequal number of columns and was hoping to create a data frame out of them. I've use lapply to convert them into vectors, and now I'm trying to put these vectors into a data frame.
I'm using rbind.na from the package {qpcR} to try out and fill out the remaining elements of each vector with NA so they all become the same size. For some reason the function isn't being recognized by do.call. Can anyone figure out why this is the case?
library(plyr)
library(qpcR)
files <- list.files(path = "C:/documents", pattern = "*.txt", full.names = TRUE)
readdata <- function(x)
{
con <- file(x, open="rt")
mydata <- readLines(con, warn = FALSE, encoding = "UTF-8")
close(con)
return(mydata)
}
all.files <- lapply(files, readdata)
combine <- do.call(rbind.na, all.files)
If anyone has any potential alternatives they can think of I'm open to that too. I actually tried using a function from here but my output didn't give me any columns.
Here is the error:
Error in do.call(rbind.na, all.files) : object 'rbind.na' not found
The package has definitely been installed too.
EDIT: changed cbind.na to rbind.na for the error.
It appears that the function is not exported by the package. Using qpcR:::rbind.na will allow you to access the function.
The triple colon allows you to access the internal variables of a namespace. Be aware though that ?":::" advises against using it in your code, presumably because objects that aren't exported can't be relied upon in future versions of a package. It suggests contacting the package maintainer to export the object if it is stable and useful.

Resources