How do I optimize a nested for loop using data.table? - r

I am interested in optimizing some code using data.table. I feel I should be able to do better than my current solution, and it does not scale well (as the number of rows increase).
Consider I have a matrix of values, with ID denoting person and the remaining values are traits (lineage in my case). I want to create a logical matrix which reflects if two ID's (rows) share any values amongst their row (including ID). I have been using data.table lately, but I cannot figure out how to do this more efficiently. I have tried (and failed) at nesting apply statements, or somehow using the .SD function of data.table to accomplish this.
The working code is below.
m <- matrix(rep(1:10,2),nrow=5,byrow=T)
m[c(1,3),3:4] <- NA
dt <- data.table(m)
setnames(dt,c("id","v1","v2","v3"))
res <- matrix(data=NA,nrow=5,ncol=5)
dimnames(res) <- list(dt[,id],dt[,id])
for (i in 1:nrow(dt)){
for (j in i:nrow(dt)){
res[j,i] <- res[i,j] <-length(na.omit(intersect(as.numeric(dt[i]),as.numeric(dt[j])))) > 0
}
}
res

I had a similar problem a while ago and somebody helped me out. Here's that help converted to your problem...
tm<-t(m) #transpose the matrix
dtt<-data.table(tm[2:4,]) #take values of matrix into data.table
setnames(dtt,as.character(tm[1,])) #make data.table column names
comblist<-combn(names(dtt),2,FUN=list) #create list of all possible column combinations
preresults<-dtt[,lapply(comblist, function(x) length(na.omit(intersect(as.numeric(get(x[1])),as.numeric(get(x[2]))))) > 0)] #recreate your double for loop
preresults<-melt(preresults,measure.vars=names(preresults)) #change columns to rows
preresults[,c("LHS","RHS"):=lapply(1:2,function(i)sapply(comblist,"[",i))] #add column labels
preresults[,variable:=NULL] #kill unneeded column
I'm drawing a blank on how to get my preresults to be in the same format as your res but this should give you the performance boost you're looking for.

Related

assign individual vector element to individual columns by rown and by name - r

I work mostly with data table but a data frame solution would work as well.
I have the result of an apply which returns this data structure
applyres=structure(c(0.0260, 3.6775, 0.92
), .Names = c("a.1", "a.2", "a.3"))
Then I have a data table
coltoadd=c('a.1','a.2','a.3')
dt <- data.table(variable1 = factor(c("what","when","where")))
dt[,coltoadd]=as.numeric(NA)
Now I would like to add the elements of applyres to the corresponding columns, just one row at a time, because applyres is calculated from another function. I have tried different assignments but nothing seems to work. Ideally I would like to assign based on column name, just in case the columns change order in one of the two structures.
This doesn't work
dt[1,coltoadd]=applyres
I also tried
dt[1,coltoadd := applyres]
And tried to change applyrest to a matrix or a data table and transpose.
I would like to do something like this
dt[1,coltoadd[i]]=applyres[coltoadd[i]]
But not sure if it should go in a loop, doesn't seem the best way to do it.
How do I avoid doing single assignments if I have a large number of columns?
1) data.frame Convert to data.frame, perform the assignments and convert back.
DF <- as.data.frame(dt)
DF[1, -1] <- applyres
# perform remaining of assignments
dt <- as.data.table(DF)
2) loop Another possibility is a for loop:
for(i in 2:ncol(dt)) dt[1, i] <- applyres[i-1]

Deviation from means in data table in R

I have a big data table called "dt", and I want to produce a data table of the same dimensions which gives the deviation from the row mean of each entry in dt.
This code works but it seems very slow to me. I hope there's a way to do it faster? Maybe I'm building my table wrong so I'm not taking advantage of the by-reference assignment. Or maybe this is as good as it gets?
(I'm a R novice so any other tips are appreciated!)
Here is my code:
library(data.table)
r <- 100 # of rows
c <- 100 # of columns
# build a data table with random cols
# (maybe not the best way to build, but this isn't important)
dt <- data.table(rnorm(r))
for (i in c(1:(c-1))) {
dt <- cbind(dt,rnorm(r))
}
colnames(dt) <- as.character(c(1:c))
devs <- copy(dt)
means <- rowMeans(dt)
for (i in c(1:nrow(devs))) {
devs[i, colnames(devs) := abs(dt[i,] - means[[i]])]
}
If you subtract a vector from a data.frame (or data.table), that vector will be subtracted from every column of the data.frame (assuming they're all numeric). Numeric functions like abs also work on all-numeric data.frames. So, you can compute devs with
devs <- abs(dt - rowMeans(dt))
You don't need a loop to create dt either, you can use replicate, which replicates its second argument a number of times specified by the first argument, and arranges the results in a matrix (unless simplify = FALSE is given as an argument)
dt <- as.data.table(replicate(r, rnorm(r)))
Not sure if its what you are looking for, but the sweep function will help you applying operation combining matrices and vectors (like your row means).
table <- matrix(rnorm(r*c), nrow=r, ncol=c) # generate random matrix
means <- apply(table, 1, mean) # compute row means
devs <- abs(sweep(table, 1, means, "-")) # compute by row the deviation from the row mean

How to apply operation and sum over columns in R?

I want to apply some operations to the values in a number of columns, and then sum the results of each row across columns. I can do this using:
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$a2 <- x$a^2
x$b2 <- x$b^2
x$result <- x$a2 + x$b2
but this will become arduous with many columns, and I'm wondering if anyone can suggest a simpler way. Note that the dataframe contains other columns that I do not want to include in the calculation (in this example, column sample is not to be included).
Many thanks!
I would simply subset the columns of interest and apply everything directly on the matrix using the rowSums function.
x <- data.frame(sample=1:3, a=4:6, b=7:9)
# put column indices and apply your function
x$result <- rowSums(x[,c(2,3)]^2)
This of course assumes your function is vectorized. If not, you would need to use some apply variation (which you are seeing many of). That said, you can still use rowSums if you find it useful like so. Note, I use sapply which also returns a matrix.
# random custom function
myfun <- function(x){
return(x^2 + 3)
}
rowSums(sapply(x[,c(2,3)], myfun))
I would suggest to convert the data set into the 'long' format, group it by sample, and then calculate the result. Here is the solution using data.table:
library(data.table)
melt(setDT(x),id.vars = 'sample')[,sum(value^2),by=sample]
# sample V1
#1: 1 65
#2: 2 89
#3: 3 117
You can easily replace value^2 by any function you want.
You can use apply function. And get those columns that you need with c(i1,i2,..,etc).
apply(( x[ , c(2, 3) ])^2, 1 ,sum )
If you want to apply a function named somefunction to some of the columns, whose indices or colnames are in the vector col_indices, and then sum the results, you can do :
# if somefunction can be vectorized :
x$results<-apply(x[,col_indices],1,function(x) sum(somefunction(x)))
# if not :
x$results<-apply(x[,col_indices],1,function(x) sum(sapply(x,somefunction)))
I want to come at this one from a "no extensions" R POV.
It's important to remember what kind of data structure you are working with. Data frames are actually lists of vectors--each column is itself a vector. So you can you the handy-dandy lapply function to apply a function to the desired column in the list/data frame.
I'm going to define a function as the square as you have above, but of course this can be any function of any complexity (so long as it takes a vector as an input and returns a vector of the same length. If it doesn't, it won't fit into the original data.frame!
The steps below are extra pedantic to show each little bit, but obviously it can be compressed into one or two steps. Note that I only retain the sum of the squares of each column, given that you might want to save space in memory if you are working with lots and lots of data.
create data; define the function
grab the columns you want as a separate (temporary) data.frame
apply the function to the data.frame/list you just created.
lapply returns a list, so if you intend to retain it seperately make it a temporary data.frame. This is not necessary.
calculate the sums of the rows of the temporary data.frame and append it as a new column in x.
remove the temp data.table.
Code:
x <- data.frame(sample=1:3, a=4:6, b=7:9); square <- function(x) x^2 #step 1
x[2:3] #Step 2
temp <- data.frame(lapply(x[2:3], square)) #step 3 and step 4
x$squareRowSums <- rowSums(temp) #step 5
rm(temp) #step 6
Here is an other apply solution
cols <- c("a", "b")
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$result <- apply(x[, cols], 1, function(x) sum(x^2))

What's the most efficient way to partition and access dataframe rows in R?

I need to iterate through a dataframe, df, where
colnames(df) == c('year','month','a','id','dollars')
I need to iterate through all of the unique pairs ('a','id'), which I've found via
counts <- count(df, c('area','normalid'))
uniquePairs <- counts[ counts$freq > 10, c('a','id') ]
Next I iterate through each of the unique pairs, finding the corresponding rows like so (I have named each column of uniquePairs appropriately):
aVec <- as.vector( uniquePairs$a )
idVec <- as.vector( uniquePairs$id )
for (i in 1:length(uniquePairs))
{
a <- aVec[i]
id <- idVec[i]
selectRows <- (df$a==a & df$id==id)
# ... get those rows and do stuff with them ...
df <- df[!selectRows,] # so lookups are slightly faster next time through
# ...
}
I know for loops are discouraged in general, but in this case I think it's appropriate. It at least seems to me to be irrelevant to this question, but maybe a more efficient way of doing this would get rid of the loop.
There are between 10-100k rows in the dataframe, and it makes sense that it'd be a worse-than-linear (though I haven't tested it) relationship between lookup time and nrow(df).
Now unique must have seen where each of these pairs occurred, even if it didn't save it. Is there a way to save that off, so that I have a boolean vector I could use for each of the pairs to more efficiently select them out of the dataframe? Or is there an alternate, better way to do this?
I have a feeling that some use of plyr or reshape could help me out, but I'm still relatively new to the large R ecosystem, so some guidance would be greatly appreciated.
data.table is your best option by far:
dt = data.table(df)
dt[,{do stuff in here, then leave results in list form},by=list(a, id)]
for a simple case of an average of some variable:
dt[,list(Mean = mean(dollars)), by = list(a, id)]

R programming - Adding extra column to existing matrix

I am a beginner to R programming and am trying to add one extra column to a matrix having 50 columns. This new column would be the avg of first 10 values in that row.
randomMatrix <- generateMatrix(1,5000,100,50)
randomMatrix51 <- matrix(nrow=100, ncol=1)
for(ctr in 1:ncol(randomMatrix)){
randomMatrix51.mat[1,ctr] <- sum(randomMatrix [ctr, 1:10])/10
}
This gives the below error
Error in randomMatrix51.mat[1, ctr] <- sum(randomMatrix[ctr, 1:10])/10 :incorrect
number of subscripts on matrix
I tried this
cbind(randomMatrix,sum(randomMatrix [ctr, 1:10])/10)
But it only works for one row, if I use this cbind in the loop all the old values are over written.
How do I add the average of first 10 values in the new column. Is there a better way to do this other than looping over rows ?
Bam!
a <- matrix(1:5000, nrow=100)
a <- cbind(a,apply(a[,1:10],1,mean))
On big datasets it is however faster (and arguably simpler) to use:
cbind(a, rowMeans(a[,1:10]) )
Methinks you are over thinking this.
a <- matrix(1:5000, nrow=100)
a <- transform(a, first10ave = colMeans(a[1:10,]))

Resources