Counting algorithm for big data in R - r

I have a big data frame with almost 1M rows (transactions) and 2600 columns (items). The values in the data set are 1's and NA's. Data type of all the values are factor. I want to add a new column to the end of the data frame which shows sum of all 1's in each row.
Here is the R code that I wrote:
for(i in 1:nrow(dataset){
counter<-0
for(j in 1:ncol(dataset){
if(!is.na(dataset[i,j])){
counter<- counter+1
}
}
dataset[i,ncol(dataset)+1]<-counter
}
But it has been a very long time that it is running in R studio because the running time is O(n^2). I am wondering if there is any other way to do that or a way to improve this algorithm? (Machine has 80Gb of memory)

Using a matrix (of numbers, not factors), as #joran suggested, would be better for this, and simply do:
rowSums(your_matrix, na.rm = T)

As eddi answer is the best in your case more general solution is to vectorize code (means: operate on all rows at once):
counter <- rep(0, nrow(dataset))
for(j in 1:ncol(dataset)) {
counter <- counter + !is.na(dataset[[j]])
}
dataset$no_of_1s <- counter
One note: in your code in line:
dataset[i,ncol(dataset)+1]<-counter
you create new column for each row (cause for each step there is one more column), so final data.frame would have 1M rows and 1M colums (so it won't fit your memory).
Another option is to use Reduce
dataset$no_of_1s <- Reduce(function(a,b) a+!is.na(b), dataset, init=integer(nrow(dataset)))

Related

Vectorizing a column-by-column comparison to separate values

I'm working with data gathered from multi-channel electrode systems, and am trying to make this run faster than it currently is, but I can't find any good way of doing it without loops.
The gist of it is; I have modified averages for each column (which is a channel), and need to compare each value in a column to the average for that column. If the value is above the adjusted mean, then I need to put that value in another data frame so it can be easily read.
Here is some sample code for the problematic bit:
readout <- data.frame(dimnmames <- c("Values"))
#need to clear the dataframe in order to run it multiple times without errors
#timeFrame is just a subsection of the original data, 60 channels with upwards of a few million rows
readout <- readout[0,]
for (i in 1:ncol(timeFrame)){
for (g in 1:nrow(timeFrame)){
if (timeFrame[g,i] >= posCompValues[i,1])
append(spikes, timeFrame[g,i])
}
}
The data ranges from 500 thousand to upwards of 130 million readings, so if anyone could point me in the right direction I'd appreciate it.
Something like this should work:
Return values of x greater than y:
cmpfun <- function(x,y) return(x[x>y])
For each element (column) of timeFrame, compare with the corresponding value of the first column of posCompValues
vals1 <- Map(cmpfun,timeFrame,posCompValues[,1])
Collapse the list into a single vector:
spikes <- unlist(vals1)
If you want to save both the value and the corresponding column it may be worth unpacking this a bit into a for loop:
resList <- list()
for (i in seq(ncol(timeFrame))) {
tt <- timeFrame[,i]
spikes <- tt[tt>posCompVals[i,1]]
if (length(spikes)>0) {
resList[[i]] <- data.frame(value=spikes,orig_col=i)
}
}
res <- do.call(rbind, resList)

Shifting rows of matrices depending on some other data

I'm sorry for repeating a question about the *apply functions, but I cannot get my code to work with the material that I found so far. I have a matrix (stored in a large data frame) and I want to shift the rows of this matrix by a certain amount (to the left). The amount by which I want to shift is different for each row and is stored in another column of the same data frame. The following code should illustrate what I am aiming for
mat <- matrix(rnorm(15),ncol=5,nrow=3);
sv <- c(1,4,2);
mat;
shift <- function(x,shift){c(x[(1+max(0,shift)):length(x)],rep(0,max(0,shift)))}
for(i in 1:nrow(mat)){mat[i,] <- shift(mat[i,],sv[i])}
mat;
But this runs incredibly slow on my 300000x201 matrix, so how could I vectorize this (using some of *apply commands)?
Working on larger chunks will speedup things
n.col <- ncol(mat)
for(i in unique(sv)){
selection <- which(sv == i)
mat[selection, 1:(n.col - i + 1)] <- mat[selection, i:n.col]
mat[selection, (n.col - i + 1):n.col] <- 0
}

How do I optimize a nested for loop using data.table?

I am interested in optimizing some code using data.table. I feel I should be able to do better than my current solution, and it does not scale well (as the number of rows increase).
Consider I have a matrix of values, with ID denoting person and the remaining values are traits (lineage in my case). I want to create a logical matrix which reflects if two ID's (rows) share any values amongst their row (including ID). I have been using data.table lately, but I cannot figure out how to do this more efficiently. I have tried (and failed) at nesting apply statements, or somehow using the .SD function of data.table to accomplish this.
The working code is below.
m <- matrix(rep(1:10,2),nrow=5,byrow=T)
m[c(1,3),3:4] <- NA
dt <- data.table(m)
setnames(dt,c("id","v1","v2","v3"))
res <- matrix(data=NA,nrow=5,ncol=5)
dimnames(res) <- list(dt[,id],dt[,id])
for (i in 1:nrow(dt)){
for (j in i:nrow(dt)){
res[j,i] <- res[i,j] <-length(na.omit(intersect(as.numeric(dt[i]),as.numeric(dt[j])))) > 0
}
}
res
I had a similar problem a while ago and somebody helped me out. Here's that help converted to your problem...
tm<-t(m) #transpose the matrix
dtt<-data.table(tm[2:4,]) #take values of matrix into data.table
setnames(dtt,as.character(tm[1,])) #make data.table column names
comblist<-combn(names(dtt),2,FUN=list) #create list of all possible column combinations
preresults<-dtt[,lapply(comblist, function(x) length(na.omit(intersect(as.numeric(get(x[1])),as.numeric(get(x[2]))))) > 0)] #recreate your double for loop
preresults<-melt(preresults,measure.vars=names(preresults)) #change columns to rows
preresults[,c("LHS","RHS"):=lapply(1:2,function(i)sapply(comblist,"[",i))] #add column labels
preresults[,variable:=NULL] #kill unneeded column
I'm drawing a blank on how to get my preresults to be in the same format as your res but this should give you the performance boost you're looking for.

R programming - Adding extra column to existing matrix

I am a beginner to R programming and am trying to add one extra column to a matrix having 50 columns. This new column would be the avg of first 10 values in that row.
randomMatrix <- generateMatrix(1,5000,100,50)
randomMatrix51 <- matrix(nrow=100, ncol=1)
for(ctr in 1:ncol(randomMatrix)){
randomMatrix51.mat[1,ctr] <- sum(randomMatrix [ctr, 1:10])/10
}
This gives the below error
Error in randomMatrix51.mat[1, ctr] <- sum(randomMatrix[ctr, 1:10])/10 :incorrect
number of subscripts on matrix
I tried this
cbind(randomMatrix,sum(randomMatrix [ctr, 1:10])/10)
But it only works for one row, if I use this cbind in the loop all the old values are over written.
How do I add the average of first 10 values in the new column. Is there a better way to do this other than looping over rows ?
Bam!
a <- matrix(1:5000, nrow=100)
a <- cbind(a,apply(a[,1:10],1,mean))
On big datasets it is however faster (and arguably simpler) to use:
cbind(a, rowMeans(a[,1:10]) )
Methinks you are over thinking this.
a <- matrix(1:5000, nrow=100)
a <- transform(a, first10ave = colMeans(a[1:10,]))

Return value from column indicated in same row

I'm stuck with a simple loop that takes more than an hour to run, and need help to speed it up.
Basically, I have a matrix with 31 columns and 400 000 rows. The first 30 columns have values, and the 31st column has a column-number. I need to, per row, retrieve the value in the column indicated by the 31st column.
Example row: [26,354,72,5987..,461,3] (this means that the value in column 3 is sought after (72))
The too slow loop looks like this:
a <- rep(0,nrow(data)) #To pre-allocate memory
for (i in 1:nrow(data)) {
a[i] <- data[i,data[i,31]]
}
I would think this would work:
a <- data[,data[,31]]
... but it results in "Error: cannot allocate vector of size 2.8 Mb".
I fear that this is a really simple question, so I've spent hours trying to understand apply, lapply, reshape, and more, but somehow I can't get a grip on the vectorization concept in R.
The matrix actually has even more columns that also go into the a-parameter, which is why I don't want to rebuild the matrix, or split it.
Your support is highly appreciated!
Chris
t(data[,1:30])[30*(0:399999)+data[,31]]
This works because you can reference matricies both in array format, and vector format (a 400000*31 long vector in this case) counting column-wise first. To count row-wise, you use the transpose.
Singe-index notation for the matrix may use less memory. This would involve doing something like:
i <- nrow(data)*(data[,31]-1) + 1:nrow(data)
a <- data[i]
Below is an example of single-index notation for matrices in R. In this example, the index of the per-row maximum is appended as the last column of a random matrix. This last column is then used to select the per-row maxima via single-index notation.
## create a random (10 x 5) matrix
M <- matrix(rpois(50,50),10,5)
## use the last column to index the maximum value of the first 5
## columns
MM <- cbind(M,apply(M,1,which.max))
## column ID row ID
i <- nrow(MM)*(MM[,ncol(MM)]-1) + 1:nrow(MM)
all(MM[i] == apply(M,1,max))
Using an index matrix is an alternative that will probably use more memory but is slightly clearer:
ii <- cbind(1:nrow(MM),MM[,ncol(MM)])
all(MM[ii] == apply(M,1,max))
Try to change the code to work a column at a time:
M <- matrix(rpois(30*400000,50),400000,30)
MM <- cbind(M,apply(M,1,which.max))
a <- rep(0,nrow(MM))
for (i in 1:(ncol(MM)-1)) {
a[MM[, ncol(MM)] == i] <- MM[MM[, ncol(MM)] == i, i]
}
This sets all elements in a with the values from column i if the last column has value i. It took longer to build the matrix than to calculate vector a.

Resources