How can you generate a large empty (zeros) numeric ffdf in R? - r

Let's say that I am trying to generate a large empty matrix of zeros that I can fill from the data (e.g. count data)
in the package ff
require(ff)
require(ffdf)
If there are 15,000 columns (variables) and 20 rows (observations), I could do the following
ffdf.object = ffdf( ff(0, dim = c(20, 15000)) )
I thought the point of ff was to load much larger datasets. For example:
> test = matrix(0, nrow = 1000000, ncol = 15000)
Error: cannot allocate vector of size 111.8 Gb
but ff gives roughly the same problem, that the total dimensions of the matrix cannot be larger than .Machine$integer.max
> test = ff(0, dim = c(1000000, ncol = 15000))
Error in if (length < 0 || length > .Machine$integer.max) stop("length must be between 1 and .Machine$integer.max") :
missing value where TRUE/FALSE needed
In addition: Warning message:
In ff(0, dim = c(1e+06, ncol = 15000)) :
NAs introduced by coercion to integer range
Is there an easy way to create a large (eg 1M by 15k) ffdf in R? Alternately is there an easy way to make the largest possible matrix ffdf and then rbind additional rows (with working code. both rbind and ffdfappend have not worked so far for me)?

You could make an SQL database. Check out the RSQLite package.

Related

How to create a matrix with huge number of rows

I want to create a big dataframe or a matrix.
the dimension of it is: col is 49 and row is 35886996700
When I am trying to create a matrix its giving me an error:
data <- data.frame(matrix(NA, # Create empty data frame nrow = (length(genes_union)*length(snp_union)),
ncol = col_length))
Error in matrix(NA, nrow = (length(genes_union) * length(snp_union)), :
invalid 'nrow' value (too large or NA)
In addition: Warning message:
In length(genes_union) * length(snp_union) :
NAs produced by integer overflow
I also tried to use big.matrix
z <- big.matrix(,nrow=35886996700,ncol=49)
Error in big.matrix(, nrow = 35886996700, ncol = 49) :
Error: memory could not be allocated for instance of type big.matrix
Is there any way to solve this problem so that I can create a matrix with these many rows.
Basically my final output matrix should look like this:
G represent gene and RS represent Ids and T represent different tissues.
T1 T2 T3 ...Tn
G1RS1
G1RS2
G1RSn
G2RS1
G2RS2
G2RSN
GnRSn
I tried to generate a vector of 0's with length 35886996700 * 49:
x1 <- 35886996700
x1
[1] 3.5887e+10
x2 <- 49
vec1 <- rep(0, x1 * x2)
Error: cannot allocate vector of size 13101.6 Gb
I can't see any way to process/manage 13,101GB of data. A big question is if the matrix is extremely sparse. Then you may be able to store the data in much more compact sparse format. If sparse storage is feasible, see the Matrix package in base R: https://www.rdocumentation.org/packages/Matrix/versions/1.5-3

Turn data into matrix and pad with NAs

I have a list of data, which I wish to turn into a matrix. I know the exact size my matrix needs to be, but the data does not completely fill it.
For example, for a vector of length 95, I would like to turn this into a 25*4 matrix. Using the matrix command does not work since the number of values does not fit the matrix so I need a way to pad this out with NAs, and fill the matrix by row.
The size of matrix will be known in each scenario, but it is not consistent from one set of data to the next, so ideally, there will be a function which automatically pads the matrix with NAs if the data is not available.
Example code:
example=c(20.28671, 20.28544, 20.28416, 20.28288, 20.28161, 20.28033, 20.27906, 20.27778, 20.27651, 20.27523, 20.27396, 20.27268, 20.27141,
20.27013, 20.26885, 20.26758, 20.26533, 20.26308, 20.26083, 20.25857, 20.25632, 20.25407, 20.25182, 20.24957, 20.24732, 20.24507,
20.24282, 20.24057, 20.23832, 20.23606, 20.23381, 20.22787, 20.22193, 20.21598, 20.21004, 20.20410, 20.19816, 20.19221, 20.18627,
20.18033, 20.17438, 20.16844, 20.16250, 20.15656, 20.15061, 20.14467, 20.13527, 20.12587, 20.11646, 20.10706, 20.09766, 20.08826,
20.07886, 20.06946, 20.06005, 20.05065, 20.04125, 20.03185, 20.02245, 20.01305, 20.00364, 20.00369, 20.00374, 20.00378, 20.00383,
20.00388, 20.00392, 20.00397, 20.00401, 20.00406, 20.00411, 20.00415, 20.00420, 20.00425, 20.00429, 20.00434, 20.01107, 20.01779,
20.02452, 20.03125, 20.03798, 20.04470, 20.05143, 20.05816, 20.06489, 20.07161, 20.07834, 20.08507, 20.09180, 20.09853, 20.10525,
20.11359, 20.12193, 20.13026, 20.13860)
mat=matrix(example,ncol=4,nrow=25)
Warning message:
In matrix(example, ncol = 4, nrow = 25) :
data length [95] is not a sub-multiple or multiple of the number of rows [25]
Whilst I'm sure this is not the best answer it does achieve what you want:
If you try to subset a vector using [ by using indicies that are beyond it's length it will pad with NA
mat = matrix(example[1:100],nrow = 25, byrow = TRUE, ncol = 4)
This feels as though it is a bit messy though. Perhaps one of the others is better R code.
You can try this:
mat <- matrix(NA,ncol=4, nrow=25)
mat[1:length(example)] <- example
We can use length<- to pad NAs to the desired length if there is shortage and then call the matrix.
nC <- 4
nR <- 25
matrix(`length<-`(example, nC*nR), nR, nC)
The length<- option can also be used in several other cases, i.e. in a list of vectors where the length are not equal. In that case, we pad NAs if we need to convert to data.frame or matrix.

Issues with nested while loop in for loop for R

I am using R to code simulations for a research project I am conducting in college. After creating relevant data structures and generating data, I seek to randomly modify a proportion P of observations (in increments of 0.02) in a 20 x 20 matrix by some effect K. In order to randomly determine the observations to be modified, I sample a number of integers equal to P*400 twice to represent row (rRow) and column (rCol) indices. In order to guarantee that no observation will be modified more than once, I perform this algorithm:
I create a matrix, alrdyModded, that is 20 x 20 and initialized to 0s.
I take the first value in rRow and rCol, and check whether alrdyModded[rRow[1]][rCol[1]]==1; WHILE alrdyModded[rRow[1]][rCol[1]]==1, i randomly select new integers for the indices until it ==0
When alrdyModded[rRow[1]][rCol[1]]==0, modify the value in a treatment matrix with same indices and change alrdyModded[rRow[1]][rCol[1]] to 1
Repeat for the entire length of rRow and rCol vectors
I believe a good method to perform this operation is a while loop nested in a for loop. However, when I enter the code below into R, I receive the following error code:
R CODE:
propModded<-1.0
trtSize<-2
numModded<-propModded*400
trt1<- matrix(rnorm(400,0,1),nrow = 20, ncol = 20)
cont<- matrix(rnorm(400,0,1),nrow = 20, ncol = 20)
alrdyModded1<- matrix(0, nrow = 20, ncol = 20)
## data structures for computation have been intitialized and filled
rCol<-sample.int(20,numModded,replace = TRUE)
rRow<-sample.int(20,numModded,replace = TRUE)
## indices for modifying observations have been generated
for(b in 1:numModded){
while(alrdyModded1[rRow[b]][rCol[b]]==1){
rRow[b]<-sample.int(20,1)
rCol[b]<-sample.int(20,1)}
trt1[rRow[b]][rCol[b]]<-'+'(trt1[rRow[b]][rCol[b]],trtSize)
alrdyModded[rRow[b]][rCol[b]]<-1
}
## algorithm for guaranteeing no observation in trt1 is modified more than once
R OUTPUT
" Error in while (alrdyModded1[rRow[b]][rCol[b]] == 1) { :
missing value where TRUE/FALSE needed "
When I take out the for loop and run the code, the while loop evaluates the statement just fine, which implies an issue with accessing the correct values from the rRow and rCol vectors. I would appreciate any help in resolving this problem.
It appears you're not indexing right within the matrix. Instead of having a condition like while(alrdyModded1[rRow[b]][rCol[b]]==1){, it should read like this: while(alrdyModded1[rRow[b], rCol[b]]==1){. Matrices are indexed like this: matrix[1, 1], and it looks like you're forgetting your commas. The for-loop should be something closer to this:
for(b in 1:numModded){
while(alrdyModded1[rRow[b], rCol[b]]==1){
rRow[b]<-sample.int(20,1)
rCol[b]<-sample.int(20,1)}
trt1[rRow[b], rCol[b]]<-'+'(trt1[rRow[b], rCol[b]],trtSize)
alrdyModded1[rRow[b], rCol[b]]<-1
}
On a side note, why not make alrdyModded1 a boolean matrix (populated with just TRUE and FALSE values) with alrdyModded1<- matrix(FALSE, nrow = 20, ncol = 20) in line 7, and have the condition be just while(alrdyModded1[rRow[b], rCol[b]]){ instead?

Convert an ff object to a data.frame

I am working with big matrix and the ff package.
I am loading an ff object and I want to use it to calculate a crps (a score).
For example, I have a ff_matrix (called Mat with 25 rows and 7303 columns) which is a precipitation forecast (7303 represents the number of days (about 20 years) and 25 are the 25 precipitation simulations for one day). I also have a ff_array with the observations for these 20 years (called Obs and with 7303 values).
With the package ensembleBMA I want to calculate the CRPS. I need to put my ff_matrix and my ff_array in an "ensembleBMA" object (in fact this is a data.frame).
For this code:
ensembleBMA(Mat,Obs)
I have this error:
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) : cannot coerce class 'c("ff_matrix", "ff_array", "ff")' into a data.frame
I tried different options such as:
as.data.frame(Mat)
as.matrix(Mat)
transform.ffdf(as.ffdf(Mat))
I always have these errors:
Error in as.data.frame.default(Mat_Ptot_212_1) : cannot automatically convert class 'c("ff_matrix", "ff_array", "ff")' into a data frame (data.frame)
or
opening ff /tmp/RtmpWrlY4n/clone9d3376b435.ff Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered, : write error
Does someone has an idea?
One way us to first convert your ff_array to an array and convert that to a data.frame:
Mat <- ff(1, vmode="double", dim=c(25, 7303))
as.data.frame(Mat[,])
or first convert your ff_array to an ffdf and convert that to an data.frame:
as.ffdf(Mat)[,]
or
as.data.frame(as.ffdf(Mat))
The last two solutions seem to be much slower than the first. This has probably to do with the large number of columns which slows down as.ffdf which has to create 7303 files.
There does not seem to be a as.data.frame.ff_array.

Doing calculations on dataframe from ffdf object

Im working with a large dataset (3.5M lines and 40 columns) and I need to clean out some values so I´ll be able to calculate other parameters that I are necessary when I start formulating a model around the data.
The problem is that it is taking forever to apply the for loops that I have been using so I wanted to try to make use of the ff package. The dataframe is called data and it consists of bunch of customer information for a bank. It was imported as a .csv file. What I need to do is remove all customers (labeled Serial) if their AverageStanding variable is ever negative
> ffd<-as.ffdf(data)
> lastserial = tail(ffd$Serial,1)
> for(k in 1:lastserial){
+ tempvecWith <- vector()
+ tempvecWith <- ffd[ffd$Serial==k, ]$AverageStanding
+ if(any(tempvecWith < 0)){
+ ffd_clean<- ffd[!ffd$Serial ==k, ]
+ }
+ }
This is the error that I am receiving:
Error in as.hi.integer(x, maxindex = maxindex, dim = dim, vw = vw, pack = pack) :
NAs in as.hi.integer
Any ideas on how I can avoid these errors?
The error comes from this part of your code ffd[ffd$Serial==k, ]. Namely ffd$Serial==k returns an ff logical vector. But if you want to index or subset an ff vector or ffdf, you need to supply the index numbers, not a vector of logicals. You can turn your ff vector of logicals into an ff vector of index numbers by using ffwhich from package ffbase.
So for your questions, I believe you are looking for this kind of code (not tested as you did not supply any data).
require(ffbase)
idx <- ffd$AverageStanding < 0
idx <- ffwhich(idx, idx==TRUE)
open(ffd)
serials.with.negative <- ffd$Serial[idx]
serials.with.negative <- unique(serials.with.negative)
ffd$is.customer.with.negative.avgstanding <- ffd$Serial %in% serials.with.negative
idx <- ffd$is.customer.with.negative.avgstanding == FALSE
idx <- ffwhich(idx, idx==TRUE)
open(ffd)
ffd_clean <- ffd[idx, ]

Resources