How to create a matrix with huge number of rows - r

I want to create a big dataframe or a matrix.
the dimension of it is: col is 49 and row is 35886996700
When I am trying to create a matrix its giving me an error:
data <- data.frame(matrix(NA, # Create empty data frame nrow = (length(genes_union)*length(snp_union)),
ncol = col_length))
Error in matrix(NA, nrow = (length(genes_union) * length(snp_union)), :
invalid 'nrow' value (too large or NA)
In addition: Warning message:
In length(genes_union) * length(snp_union) :
NAs produced by integer overflow
I also tried to use big.matrix
z <- big.matrix(,nrow=35886996700,ncol=49)
Error in big.matrix(, nrow = 35886996700, ncol = 49) :
Error: memory could not be allocated for instance of type big.matrix
Is there any way to solve this problem so that I can create a matrix with these many rows.
Basically my final output matrix should look like this:
G represent gene and RS represent Ids and T represent different tissues.
T1 T2 T3 ...Tn
G1RS1
G1RS2
G1RSn
G2RS1
G2RS2
G2RSN
GnRSn

I tried to generate a vector of 0's with length 35886996700 * 49:
x1 <- 35886996700
x1
[1] 3.5887e+10
x2 <- 49
vec1 <- rep(0, x1 * x2)
Error: cannot allocate vector of size 13101.6 Gb
I can't see any way to process/manage 13,101GB of data. A big question is if the matrix is extremely sparse. Then you may be able to store the data in much more compact sparse format. If sparse storage is feasible, see the Matrix package in base R: https://www.rdocumentation.org/packages/Matrix/versions/1.5-3

Related

Loop through a character vector to use in a function

I am conducting a methodcomparison study, comparing measurements from two different systems. My dataset has a large number of columns with variabels containing measurements from one of the two systems.
aX and bX are both measures of X, but from system a and b. I have about 80 pairs of variabels like this.
A simplified version of my data looks like this:
set.seed(1)
df <- data.frame(
ID = as.factor(rep(1:2, each=10)),
aX = rep(1:10+rnorm(10,mean=1,sd=0.5),2),
bX = rep(1:10+rnorm(10,mean=1,sd=0.5),2),
aY = rep(1:10+rnorm(10,mean=1,sd=0.5), 2),
bY = rep(1:10-rnorm(10,mean=1,sd=0.5),2))
head(df)
ID aX bX aY bY
1 1 1.686773 2.755891 2.459489 -0.6793398
2 1 3.091822 3.194922 3.391068 1.0513939
3 1 3.582186 3.689380 4.037282 1.8061642
4 1 5.797640 3.892650 4.005324 3.0269025
5 1 6.164754 6.562465 6.309913 4.6885298
6 1 6.589766 6.977533 6.971936 5.2074973
I am trying to loop through the elements of a character vector, and use the elements to point to columns in the dataframe. But I keep getting error messages when I try to call functions with variable names generated in the loop.
For simplicity, I have changed the loop to include a linear model as this produces the same type of error as I have in my original script.
#This line is only included to show that
#the formula used in the loop works when
#called with directly with the "real" column names
(broom::glance(lm(aX~bX, data = df)))$r.squared
[1] 0.9405218
#Now I try the loop
varlist <- c("X", "Y")
for(i in 1:length(varlist)){
aVAR <- paste0("a", varlist[i])
bVAR <- paste0("b", varlist[i])
#VAR and cVAR appear to have names identical column names in the df dataframe
print(c(aVAR, bVAR))
#Try the formula with the loop variable names
print((broom::glance(lm(aVAR~bVAR, data = df)))$r.squared)
}
The error messages I get when calling the functions from inside the loop vary according to the function I am calling, the common denominator for all the errors is that the occur when I try to use the character vector (varlist) to pick out specific columns.
Example of error messages:
rmcorr(ID, aVAR, bVAR, df)
Error in rmcorr(ID, aVAR, bVAR, df) :
'Measure 1' and 'Measure 2' must be numeric
or
broom::glance(lm(aVAR~bVAR, data = df))
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
In addition: Warning message:
In storage.mode(v) <- "double" : NAs introduced by coercion
Can you help me understand what goes wrong in the loop? Or suggest and show another way to acomplish what I am trying to do.
Variables aren't evaluated in formulas (the things with ~).
You can type
bert ~ ernie
and not get an error even if variables named bert and ernie do not exist. Formula store relationships between symbols/names and does not attempt to evaulate them. Also note we are not using quotes here. Variable names (or symbols) are not interchangeable with character values (ie aX is very different from "aX").
So when putting together a formula from string values, I suggest you use the reformualte() function. It takes a vector of names for the right-hand side and an optional value for the left hand side. So you would create the same formula with
reformulate("ernie", "bert")
# bert ~ ernie
And you can use the with your lm
lm(reformulate(bVAR, aVAR), data = df)
I'm too lazy to search for a duplicate on how to construct formulas programmatically, so here is a solution:
varlist <- c("X", "Y")
for(i in 1:length(varlist)){
#make these symbols:
aVAR <- as.symbol(paste0("a", varlist[i]))
bVAR <- as.symbol(paste0("b", varlist[i]))
#VAR and cVAR appear to have names identical column names in the df dataframe
print(c(aVAR, bVAR))
#Try the formula with the loop variable names
#construct the call to `lm` with `bquote` and `eval` the expression
print((broom::glance(eval(bquote(lm(.(aVAR) ~ .(bVAR), data = df)))))$r.squared)
}

How can you generate a large empty (zeros) numeric ffdf in R?

Let's say that I am trying to generate a large empty matrix of zeros that I can fill from the data (e.g. count data)
in the package ff
require(ff)
require(ffdf)
If there are 15,000 columns (variables) and 20 rows (observations), I could do the following
ffdf.object = ffdf( ff(0, dim = c(20, 15000)) )
I thought the point of ff was to load much larger datasets. For example:
> test = matrix(0, nrow = 1000000, ncol = 15000)
Error: cannot allocate vector of size 111.8 Gb
but ff gives roughly the same problem, that the total dimensions of the matrix cannot be larger than .Machine$integer.max
> test = ff(0, dim = c(1000000, ncol = 15000))
Error in if (length < 0 || length > .Machine$integer.max) stop("length must be between 1 and .Machine$integer.max") :
missing value where TRUE/FALSE needed
In addition: Warning message:
In ff(0, dim = c(1e+06, ncol = 15000)) :
NAs introduced by coercion to integer range
Is there an easy way to create a large (eg 1M by 15k) ffdf in R? Alternately is there an easy way to make the largest possible matrix ffdf and then rbind additional rows (with working code. both rbind and ffdfappend have not worked so far for me)?
You could make an SQL database. Check out the RSQLite package.

Turn data into matrix and pad with NAs

I have a list of data, which I wish to turn into a matrix. I know the exact size my matrix needs to be, but the data does not completely fill it.
For example, for a vector of length 95, I would like to turn this into a 25*4 matrix. Using the matrix command does not work since the number of values does not fit the matrix so I need a way to pad this out with NAs, and fill the matrix by row.
The size of matrix will be known in each scenario, but it is not consistent from one set of data to the next, so ideally, there will be a function which automatically pads the matrix with NAs if the data is not available.
Example code:
example=c(20.28671, 20.28544, 20.28416, 20.28288, 20.28161, 20.28033, 20.27906, 20.27778, 20.27651, 20.27523, 20.27396, 20.27268, 20.27141,
20.27013, 20.26885, 20.26758, 20.26533, 20.26308, 20.26083, 20.25857, 20.25632, 20.25407, 20.25182, 20.24957, 20.24732, 20.24507,
20.24282, 20.24057, 20.23832, 20.23606, 20.23381, 20.22787, 20.22193, 20.21598, 20.21004, 20.20410, 20.19816, 20.19221, 20.18627,
20.18033, 20.17438, 20.16844, 20.16250, 20.15656, 20.15061, 20.14467, 20.13527, 20.12587, 20.11646, 20.10706, 20.09766, 20.08826,
20.07886, 20.06946, 20.06005, 20.05065, 20.04125, 20.03185, 20.02245, 20.01305, 20.00364, 20.00369, 20.00374, 20.00378, 20.00383,
20.00388, 20.00392, 20.00397, 20.00401, 20.00406, 20.00411, 20.00415, 20.00420, 20.00425, 20.00429, 20.00434, 20.01107, 20.01779,
20.02452, 20.03125, 20.03798, 20.04470, 20.05143, 20.05816, 20.06489, 20.07161, 20.07834, 20.08507, 20.09180, 20.09853, 20.10525,
20.11359, 20.12193, 20.13026, 20.13860)
mat=matrix(example,ncol=4,nrow=25)
Warning message:
In matrix(example, ncol = 4, nrow = 25) :
data length [95] is not a sub-multiple or multiple of the number of rows [25]
Whilst I'm sure this is not the best answer it does achieve what you want:
If you try to subset a vector using [ by using indicies that are beyond it's length it will pad with NA
mat = matrix(example[1:100],nrow = 25, byrow = TRUE, ncol = 4)
This feels as though it is a bit messy though. Perhaps one of the others is better R code.
You can try this:
mat <- matrix(NA,ncol=4, nrow=25)
mat[1:length(example)] <- example
We can use length<- to pad NAs to the desired length if there is shortage and then call the matrix.
nC <- 4
nR <- 25
matrix(`length<-`(example, nC*nR), nR, nC)
The length<- option can also be used in several other cases, i.e. in a list of vectors where the length are not equal. In that case, we pad NAs if we need to convert to data.frame or matrix.

having troubles with handling large data in R

Im currently making recommender system with 8k users and 200k items using recommenderlab package.
Before using the functions of recommenderlab, I'm having troubles with converting my data frame to real rating matrix.
item_idx mem_idx rating
1 00600015987465341234f7dae4 534122168382b 4
2 0060001660924533ad0cd443e1 53d79f413e3aa 5
3 006000195520453d7ac28e4b4b 53d79f413e3aa 5
4 0060001986642536d6fc77d269 535146eb5af95 4
5 00708969975005409278f828f3 540927366f478 5
This is the part of my data frame, all the (item_idx, mem_idx) pairs are distinct.
mat <- tapply(df$rating, list(df$mem_idx, df$ID), FUN=function(x) x)
I tried to convert data frame to matrix using this code, some times success but usually there occur error like this.
Error: cannot allocate vector of size 1.1 Gb
In the succeeded case,
r <- as(mat, "realRatingMatrix")
I applied this code to make it realRatingMatrix
But I always failed with this error
Error in which(x == 0, arr.ind = TRUE) :
error in evaluating the argument 'x' in selecting a method for function 'which': Error: (list) object cannot be coerced to type 'double'
Anyone who knows how to escape one of these errors, please help me.
Convert the dataframe to a sparse matrix and then to realRatingMatrix class
itm <- factor(data[,1])
mem <- factor(data[,2])
# sparsematrix
s <- sparseMatrix(
as.numeric(itm),
as.numeric(mem),
dimnames = list(
as.character(levels(itm)),
as.character(levels(mem))),
x = data[,3])
#convert to realRatingMatrix class
rm <- new("realRatingMatrix",data=s)

Issues with nested while loop in for loop for R

I am using R to code simulations for a research project I am conducting in college. After creating relevant data structures and generating data, I seek to randomly modify a proportion P of observations (in increments of 0.02) in a 20 x 20 matrix by some effect K. In order to randomly determine the observations to be modified, I sample a number of integers equal to P*400 twice to represent row (rRow) and column (rCol) indices. In order to guarantee that no observation will be modified more than once, I perform this algorithm:
I create a matrix, alrdyModded, that is 20 x 20 and initialized to 0s.
I take the first value in rRow and rCol, and check whether alrdyModded[rRow[1]][rCol[1]]==1; WHILE alrdyModded[rRow[1]][rCol[1]]==1, i randomly select new integers for the indices until it ==0
When alrdyModded[rRow[1]][rCol[1]]==0, modify the value in a treatment matrix with same indices and change alrdyModded[rRow[1]][rCol[1]] to 1
Repeat for the entire length of rRow and rCol vectors
I believe a good method to perform this operation is a while loop nested in a for loop. However, when I enter the code below into R, I receive the following error code:
R CODE:
propModded<-1.0
trtSize<-2
numModded<-propModded*400
trt1<- matrix(rnorm(400,0,1),nrow = 20, ncol = 20)
cont<- matrix(rnorm(400,0,1),nrow = 20, ncol = 20)
alrdyModded1<- matrix(0, nrow = 20, ncol = 20)
## data structures for computation have been intitialized and filled
rCol<-sample.int(20,numModded,replace = TRUE)
rRow<-sample.int(20,numModded,replace = TRUE)
## indices for modifying observations have been generated
for(b in 1:numModded){
while(alrdyModded1[rRow[b]][rCol[b]]==1){
rRow[b]<-sample.int(20,1)
rCol[b]<-sample.int(20,1)}
trt1[rRow[b]][rCol[b]]<-'+'(trt1[rRow[b]][rCol[b]],trtSize)
alrdyModded[rRow[b]][rCol[b]]<-1
}
## algorithm for guaranteeing no observation in trt1 is modified more than once
R OUTPUT
" Error in while (alrdyModded1[rRow[b]][rCol[b]] == 1) { :
missing value where TRUE/FALSE needed "
When I take out the for loop and run the code, the while loop evaluates the statement just fine, which implies an issue with accessing the correct values from the rRow and rCol vectors. I would appreciate any help in resolving this problem.
It appears you're not indexing right within the matrix. Instead of having a condition like while(alrdyModded1[rRow[b]][rCol[b]]==1){, it should read like this: while(alrdyModded1[rRow[b], rCol[b]]==1){. Matrices are indexed like this: matrix[1, 1], and it looks like you're forgetting your commas. The for-loop should be something closer to this:
for(b in 1:numModded){
while(alrdyModded1[rRow[b], rCol[b]]==1){
rRow[b]<-sample.int(20,1)
rCol[b]<-sample.int(20,1)}
trt1[rRow[b], rCol[b]]<-'+'(trt1[rRow[b], rCol[b]],trtSize)
alrdyModded1[rRow[b], rCol[b]]<-1
}
On a side note, why not make alrdyModded1 a boolean matrix (populated with just TRUE and FALSE values) with alrdyModded1<- matrix(FALSE, nrow = 20, ncol = 20) in line 7, and have the condition be just while(alrdyModded1[rRow[b], rCol[b]]){ instead?

Resources