having troubles with handling large data in R - r

Im currently making recommender system with 8k users and 200k items using recommenderlab package.
Before using the functions of recommenderlab, I'm having troubles with converting my data frame to real rating matrix.
item_idx mem_idx rating
1 00600015987465341234f7dae4 534122168382b 4
2 0060001660924533ad0cd443e1 53d79f413e3aa 5
3 006000195520453d7ac28e4b4b 53d79f413e3aa 5
4 0060001986642536d6fc77d269 535146eb5af95 4
5 00708969975005409278f828f3 540927366f478 5
This is the part of my data frame, all the (item_idx, mem_idx) pairs are distinct.
mat <- tapply(df$rating, list(df$mem_idx, df$ID), FUN=function(x) x)
I tried to convert data frame to matrix using this code, some times success but usually there occur error like this.
Error: cannot allocate vector of size 1.1 Gb
In the succeeded case,
r <- as(mat, "realRatingMatrix")
I applied this code to make it realRatingMatrix
But I always failed with this error
Error in which(x == 0, arr.ind = TRUE) :
error in evaluating the argument 'x' in selecting a method for function 'which': Error: (list) object cannot be coerced to type 'double'
Anyone who knows how to escape one of these errors, please help me.

Convert the dataframe to a sparse matrix and then to realRatingMatrix class
itm <- factor(data[,1])
mem <- factor(data[,2])
# sparsematrix
s <- sparseMatrix(
as.numeric(itm),
as.numeric(mem),
dimnames = list(
as.character(levels(itm)),
as.character(levels(mem))),
x = data[,3])
#convert to realRatingMatrix class
rm <- new("realRatingMatrix",data=s)

Related

Error: incorrect number of dimensions & How to make an object that is the sum of the 5th column of the matrix contained in the object "myList"

I am new to coding and I am having problems with the hw question (part 1.3): "make an object called 'ans2' that is the sum of the 5th column of the matrix contained in "myList"", because I keep getting different variations of errors whenever i try to run it
The question also specifically states: *Use [], [[]], [,] notation to answer this part
first for context this is my work for the previous parts of this problem which all ran without error:
#1.1
myList <- c(1:10)
matrix(c(1:25), byrow = TRUE, nrow=5, )
list( str = "latte", vec = "taco", "nan", vec = c(1:7)) #question was Make an object called `myList` that contains the following elements (in order):The integers 1 through 10
* A matrix with the integers 1 through 25 that has five rows, filled by row
* A list that contains
- The text "latte"
- A vector with the text "taco" and "nan"
- A vector with the integers 1 through 7
#1.2
ans1 <- myList[4] #this question was: Make an object called `ans1` that is the fourth row of the matrix contained in `myList`
I tried doing this for this question:
ans2 <- colSums(myList[,c(5)], na.rm=TRUE )
But I got the error:
Error in myList[ , c(5)] : incorrect number of dimensions
Then I looked on some of the exchanges on this site and tried,
ans2 <- myList[1:nrow(myList),5]
But I got another new error:
Error in 1:nrow(myList) : argument of length 0
Would really appreciate help on how to solve this problem!

How to create a matrix with huge number of rows

I want to create a big dataframe or a matrix.
the dimension of it is: col is 49 and row is 35886996700
When I am trying to create a matrix its giving me an error:
data <- data.frame(matrix(NA, # Create empty data frame nrow = (length(genes_union)*length(snp_union)),
ncol = col_length))
Error in matrix(NA, nrow = (length(genes_union) * length(snp_union)), :
invalid 'nrow' value (too large or NA)
In addition: Warning message:
In length(genes_union) * length(snp_union) :
NAs produced by integer overflow
I also tried to use big.matrix
z <- big.matrix(,nrow=35886996700,ncol=49)
Error in big.matrix(, nrow = 35886996700, ncol = 49) :
Error: memory could not be allocated for instance of type big.matrix
Is there any way to solve this problem so that I can create a matrix with these many rows.
Basically my final output matrix should look like this:
G represent gene and RS represent Ids and T represent different tissues.
T1 T2 T3 ...Tn
G1RS1
G1RS2
G1RSn
G2RS1
G2RS2
G2RSN
GnRSn
I tried to generate a vector of 0's with length 35886996700 * 49:
x1 <- 35886996700
x1
[1] 3.5887e+10
x2 <- 49
vec1 <- rep(0, x1 * x2)
Error: cannot allocate vector of size 13101.6 Gb
I can't see any way to process/manage 13,101GB of data. A big question is if the matrix is extremely sparse. Then you may be able to store the data in much more compact sparse format. If sparse storage is feasible, see the Matrix package in base R: https://www.rdocumentation.org/packages/Matrix/versions/1.5-3

How to extract components from an object of class "spec"?

I am trying to construct a table of power spectra and run into this problem:
Define the table:
V <- tibble(month=double(),day=double(),hour=double(),minutes=double(),
frequency=double(),power=double(),period=double())
compute the spectrum:
S <- spec.pgram(Spec2d$Inst,spans=windowSize,log="yes")
which creates an object of class "spec"
I need to extract the data from S and put it into V. When I try:
V$frequency <- S$freq
I get this error message:
Error: Assigned data `S$freq` must be compatible with existing data.
x Existing data has 0 rows.
x Assigned data has 48 rows.
ℹ Only vectors of size 1 are recycled.
which doesn't make sense to me. I have tried to coerce S$freq into different different types of objects but nothing works.
S$freq is a vector of length 48 as in the error message
What is going on? Is there a workaround?
Don't initialise the dataframe/tibble first. Try :
S <- spec.pgram(Spec2d$Inst,spans=windowSize,log="yes")
V <- data.frame(frequency = S$freq)

Loop through a character vector to use in a function

I am conducting a methodcomparison study, comparing measurements from two different systems. My dataset has a large number of columns with variabels containing measurements from one of the two systems.
aX and bX are both measures of X, but from system a and b. I have about 80 pairs of variabels like this.
A simplified version of my data looks like this:
set.seed(1)
df <- data.frame(
ID = as.factor(rep(1:2, each=10)),
aX = rep(1:10+rnorm(10,mean=1,sd=0.5),2),
bX = rep(1:10+rnorm(10,mean=1,sd=0.5),2),
aY = rep(1:10+rnorm(10,mean=1,sd=0.5), 2),
bY = rep(1:10-rnorm(10,mean=1,sd=0.5),2))
head(df)
ID aX bX aY bY
1 1 1.686773 2.755891 2.459489 -0.6793398
2 1 3.091822 3.194922 3.391068 1.0513939
3 1 3.582186 3.689380 4.037282 1.8061642
4 1 5.797640 3.892650 4.005324 3.0269025
5 1 6.164754 6.562465 6.309913 4.6885298
6 1 6.589766 6.977533 6.971936 5.2074973
I am trying to loop through the elements of a character vector, and use the elements to point to columns in the dataframe. But I keep getting error messages when I try to call functions with variable names generated in the loop.
For simplicity, I have changed the loop to include a linear model as this produces the same type of error as I have in my original script.
#This line is only included to show that
#the formula used in the loop works when
#called with directly with the "real" column names
(broom::glance(lm(aX~bX, data = df)))$r.squared
[1] 0.9405218
#Now I try the loop
varlist <- c("X", "Y")
for(i in 1:length(varlist)){
aVAR <- paste0("a", varlist[i])
bVAR <- paste0("b", varlist[i])
#VAR and cVAR appear to have names identical column names in the df dataframe
print(c(aVAR, bVAR))
#Try the formula with the loop variable names
print((broom::glance(lm(aVAR~bVAR, data = df)))$r.squared)
}
The error messages I get when calling the functions from inside the loop vary according to the function I am calling, the common denominator for all the errors is that the occur when I try to use the character vector (varlist) to pick out specific columns.
Example of error messages:
rmcorr(ID, aVAR, bVAR, df)
Error in rmcorr(ID, aVAR, bVAR, df) :
'Measure 1' and 'Measure 2' must be numeric
or
broom::glance(lm(aVAR~bVAR, data = df))
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
In addition: Warning message:
In storage.mode(v) <- "double" : NAs introduced by coercion
Can you help me understand what goes wrong in the loop? Or suggest and show another way to acomplish what I am trying to do.
Variables aren't evaluated in formulas (the things with ~).
You can type
bert ~ ernie
and not get an error even if variables named bert and ernie do not exist. Formula store relationships between symbols/names and does not attempt to evaulate them. Also note we are not using quotes here. Variable names (or symbols) are not interchangeable with character values (ie aX is very different from "aX").
So when putting together a formula from string values, I suggest you use the reformualte() function. It takes a vector of names for the right-hand side and an optional value for the left hand side. So you would create the same formula with
reformulate("ernie", "bert")
# bert ~ ernie
And you can use the with your lm
lm(reformulate(bVAR, aVAR), data = df)
I'm too lazy to search for a duplicate on how to construct formulas programmatically, so here is a solution:
varlist <- c("X", "Y")
for(i in 1:length(varlist)){
#make these symbols:
aVAR <- as.symbol(paste0("a", varlist[i]))
bVAR <- as.symbol(paste0("b", varlist[i]))
#VAR and cVAR appear to have names identical column names in the df dataframe
print(c(aVAR, bVAR))
#Try the formula with the loop variable names
#construct the call to `lm` with `bquote` and `eval` the expression
print((broom::glance(eval(bquote(lm(.(aVAR) ~ .(bVAR), data = df)))))$r.squared)
}

Matrix assignment in R as a result of logical operation

With R I want to generate a matrix where the epsilons are the columns and the rows are the input data. However when I try to assign a value to matrix an error appears:
"Error in results[, j] <- (probabilities > epsilons[j]) :
replacement has length zero"
I tried many ways but I am stuck with this. Please note that this problem happens when oracle R objects are in use. See a small code below that reproduces the problem:
library(ORE)
ore.connect(user="XXXX", service_name="XXXXX", host="XXXXXXXX", password="XXXXX", port=XXXX, all=TRUE)
ore.sync('MYDATABASE')
ore.attach()
ore.pull(MY_TABLE)
trainingset <- MY_TABLE$MY_COLUMN[1:1000]
crossvalidationset <- MY_TABLE$MY_COLUMN[1001-2000]
# Training
my_column_avg <- mean(trainingset)
my_column_std <- sd(trainingset)
# Validation
probabilities <- dnorm(crossvalidationset,my_column_avg,my_column_std)
epsilons <- c(0.01,0.05,0.1,0.25,0.5,0.75,0.8)
num_rows <- length(probabilities)
num_cols <- length(epsilons)
results <- matrix(TRUE, num_rows, num_cols)
# Anomaly detection results for several epsilons
for(j in 1:num_cols)
{
results[,j] <- (probabilities > epsilons[j])
}
Object MY_TABLE is an oracle table object not a data-frame as well as probabilities since it was derived from MY_TABLE.
However when a value assignment was tried to an R matrix than the error was happening as shown in the line below:
results[,j] <- (probabilities > epsilons[j])
The reason of the error described above was due to the use of oracle R library (ORE).
If common R data structures are used in the code above since the beginning than this problem never happens. For instance by replacing MY_TABLE oracle object to a data-frame.
Therefore it is a good practice to get rid of Oracle R objects and use R data frames whenever possible.

Resources