Replacing nested for loops with an apply function - r

I am building a movie recommendation engine and the below code computes the similarity matrix.
data <- read.csv('movie_test.csv')
similarity <- matrix(NA, nrow(data), nrow(data))
for (i in 1:nrow(data)) {
for (j in 1:nrow(data)) {
if (i != j) {
similarity[i, j] <- sum((data[i,] * data[j,]), na.rm = TRUE) /
(sqrt((sum(((data[i,] - data[j,] + data[j,]) * data[i,]), na.rm = TRUE))) *
sqrt((sum(((data[j,] - data[i,] + data[i,]) * data[j,]), na.rm = TRUE))))
}
}
}
For a small dataset this works perfect. But for 900 users and 1000 movies this does not scale. I have heard that the apply set of functions works faster but I doubt even that will scale. Is there any other way I can achieve the above task without using a for loop?
Thank you so much for your suggestions!!!

This should be fast:
m <- as.matrix(data)
m[is.na(m)] <- 0
z <- m %*% t(m)
d <- sqrt(diag(z))
similarity <- t(t(z) / d) / d
The diagonal will contain 1 which seems more appropriate than NA but if you prefer you can always do:
diag(similarity) <- NA

Related

How do I convolve() more than two distributions without doubling the result rowcount every time?

I am attempting to convolve() 36 beta distributions. The result distribution is one of the two input distributions to the next successive call to convolve(). After every convolve(), the result has row count = (nrow(vector1)+nrow(vector2)-1). In effect, the row count of the result distribution almost doubles with every call to convolve(). This is very inefficient - it makes runtime impossibly long and consumes large amounts of memory. Is there any way to keep the row count constant?
Code example below ...
# Function from https://stat.ethz.ch/pipermail/r-help/2008-July/168762.html
weighted.var <- function(x, w, na.rm = FALSE) {
if (na.rm) {
w <- w[i <- !is.na(x)]
x <- x[i]
}
sum.w <- sum(w)
sum.w2 <- sum(w^2)
mean.w <- sum(x * w) / sum(w)
(sum.w / (sum.w^2 - sum.w2)) * sum(w * (x - mean.w)^2, na.rm = na.rm);
}
# Define beta distribution shape parameters.
s1a <- 3.52; s1b <- 65.35;
s2a <- 1.684; s2b <- 189.12;
s3a <- 5.696; s3b <- 32.34;
s4a <- 1.81; s4b <- 185.5;
# Define intial set of quantiles.
mQ1 <- matrix(data=seq(0,1,1/1000),ncol=1);
for (i in 1:3){
mPDF <- matrix(data=convolve(dbeta(mQ1,s1a,s1b),rev(dbeta(mQ1,s2a,s2b)),type="open"),ncol=1L);
print(paste(nrow(mPDF),' rows',sep=''));
if(i < 3){
# Calculate the merged shape parameters directly from mPDF.
mQ2 <- matrix(data=seq(0,1L,(1L/(nrow(mPDF)-1L))),ncol=1L);
wtMean <- weighted.mean(mQ2,mPDF);
wtStd <- sqrt(weighted.var(mQ2,mPDF));
s1a <- -1L * ((wtMean*(wtStd^2 + wtMean^2 - wtMean))/wtStd^2);
s1b <- ((wtStd^2 + wtMean^2 - wtMean)*(wtMean - 1))/wtStd^2;
s2a <- s3a; s2b <- s3b;
mQ1 <- mQ2;
}
} #i

What is a better way to write this nested for loop in R?

I am writing a for loop to calculate a numerator which is part of a larger formula. I used a for loop but it is taking a lot of time to compute. What would be a better way to do this.
city is a dataframe with the following columns: pop, not.white, pct.not.white
n <- nrow(city)
numerator = 0
for(i in 1:n) {
ti <- city$pop[i]
pi<- city$pct.not.white[i]
for(j in 1:n) {
tj <- city$pop[j]
pj <- city$pct.not.white[j]
numerator = numerator + (ti * tj) * abs(pi -pj)
}
}
Use the following toy data for result validation.
set.seed(0)
city <- data.frame(pop = runif(101), pct.not.white = runif(101))
The most obvious "vectorization":
# n <- nrow(city)
titj <- tcrossprod(city$pop)
pipj <- outer(city$pct.not.white, city$pct.not.white, "-")
numerator <- sum(titj * abs(pipj))
Will probably have memory problem if n > 5000.
A clever workaround (exploiting symmetry; more memory efficient "vectorization"):
## see https://stackoverflow.com/a/52086291/4891738 for function: tri_ind
n <- nrow(city)
ij <- tri_ind(n, lower = TRUE, diag = FALSE)
titj <- city$pop[ij$i] * city$pop[ij$j]
pipj <- abs(city$pct.not.white[ij$i] - city$pct.not.white[ij$j])
numerator <- 2 * crossprod(titj, pipj)[1]
The ultimate solution is to write C / C++ loop, which I will not showcase.

speed problems with odesolver in R

I have a differential equation model in R that uses the odesolver from the deSolve package. However, at the moment the model is running very slowly. I think this might be something to do with the function that I feed to odesolver being poorly written, but can't figure out what exactly is slowing it down and how I might speed it up. Does anyone have any ideas?
I've made an example that works in a similar way to mine:
library(data.table)
library(deSolve)
matrix_1 <- matrix(runif(100),10,10)
matrix_1[which(matrix_1 > 0.5)] <- 1
matrix_1[which(matrix_1 < 0.5)] <- 0
matrix_2 <- matrix(runif(100),10,10)
matrix_2[which(matrix_2 > 0.5)] <- 1
matrix_2[which(matrix_2 < 0.5)] <- 0
group_ID <- rep(c(1,2), 5)
N <- runif(10, 0, 100000)
Nchange <- function(t, N, parameters) {
with(as.list(c(N, parameters)), {
N_per_1 <- matrix_1 * N_per_connection
N_per_1[is.na(N_per_1)] <- 0
total_N_2 <- as.vector(N_per_1)
if (nrow(as.matrix(N_per_1)) > 1) {
total_N_2 <- colSums(N_per_1[drop = FALSE])
}
N_per_1_cost <- N_per_1
for (i in possible_competition) {
column <- as.vector(N_per_1[, i])
if (sum(column) > 0) {
active_groups <- unique(group_ID[column > 0])
if (length(active_groups) > 1){
group_ID_dets <- data.table("group_ID" = group_ID, "column"= column, "n_IDS" = 1:length(group_ID))
group_ID_dets$portions <- ave(group_ID_dets$column, group_ID_dets$group_ID, FUN = function(x) x / sum(x))
group_ID_dets[is.na(group_ID_dets)] <- 0
totals <- as.vector(unlist(tapply(group_ID_dets$column, group_ID_dets$group_ID, function(x) sum(x))))
totals[is.na(totals)] <- 0
totals <- totals*2 - sum(totals)
totals[totals < 0] <- 0
group_ID_totals <- data.table("group_ID" = unique(group_ID), "totals" = as.vector(totals))
group_ID_dets$totals <- group_ID_totals$totals[match(group_ID_dets$group_ID, group_ID_totals$group_ID)]
N_per_1[, i] <- group_ID_dets$totals * group_ID_dets$portions
}
}
}
res_per_1 <- N_per_1 * 0.1
N_per_2 <- matrix_2 * N_per_connection
N_per_2[is.na(N_per_2)] <- 0
res_per_2 <- N_per_2 * 0.1
dN <- rowSums(res_per_1) - rowSums(N_per_1_cost * 0.00003) + rowSums(res_per_2) -
rowSums(N_per_2 * 0.00003) - N*0.03
list(c(dN))
})
} # function describing differential equations
N_per_connection <- N/(rowSums(matrix_1) + rowSums(matrix_2))
possible_competition <- which(colSums(matrix_1 != 0)>1)
times <- seq(0, 100, by = 1)
out <- ode(y = N, times = times, func = Nchange, parms = NULL)
A good way to identify the bottle neck is with a profiler and the profvis package provides a good way of drilling down into the results. Wrapping your code in p <- profvis({YourCodeInHere}) and then viewing the results with print(p) gives the following insights:
The lines that are taking the most time are (in descending order of time taken):
group_ID_totals <- data.table("group_ID" = unique(group_ID), "totals" = as.vector(totals))
group_ID_dets$portions <- ave(group_ID_dets$column, group_ID_dets$group_ID, FUN = function(x) x / sum(x))
group_ID_dets <- data.table("group_ID" = group_ID, "column"= column, "n_IDS" = 1:length(group_ID))
totals <- as.vector(unlist(tapply(group_ID_dets$column, group_ID_dets$group_ID, function(x) sum(x))))
group_ID_dets$totals <- group_ID_totals$totals[match(group_ID_dets$group_ID, group_ID_totals$group_ID)]
I'm not familiar with the details of your ODE, but you should focus on optimising these tasks. I think the larger issue is that you're running these commands in a loop. Often, you'll hear that loops are slow in R, but a more nuanced discussion of this issue is found in the answers here. Some tips there might help you restructure your code/loop. Good luck!

Parallelize an R Script

The problem with my R script is that it takes too much time and the main solution that I consider is to parallelize it. I don't know where to start.
My code look like this:
n<- nrow (aa)
output <- matrix (0, n, n)
akl<- function (dii){
ddi<- as.matrix (dii)
m<- rowMeans(ddi)
M<- mean(ddi)
r<- sweep (ddi, 1, m)
b<- sweep (r, 2, m)
return (b + M)
}
for (i in 1:n)
{
A<- akl(dist(aa[i,]))
dVarX <- sqrt(mean (A * A))
for (j in i:n)
{
B<- akl(dist(aa[j,]))
V <- sqrt (dVarX * (sqrt(mean(B * B))))
output[i,j] <- (sqrt(mean(A * B))) / V
}
}
I would like to parallelize on different cpus. How can I do that?
I saw the SNOW package, is it suitable for my purpose?
Thank you for suggestions,
Gab
There are two ways in which your code could be made to run faster that I could think of:
First: As #Dwin was saying (with a small twist), you could precompute akl (yes, not necesarily dist, but the whole of akl).
# a random square matrix
aa <- matrix(runif(100), ncol=10)
n <- nrow(aa)
output <- matrix (0, n, n)
akl <- function(dii) {
ddi <- as.matrix(dii)
m <- rowMeans(ddi)
M <- mean(m) # mean(ddi) == mean(m)
r <- sweep(ddi, 1, m)
b <- sweep(r, 2, m)
return(b + M)
}
# precompute akl here
require(plyr)
akl.list <- llply(1:nrow(aa), function(i) {
akl(dist(aa[i, ]))
})
# Now, apply your function, but index the list instead of computing everytime
for (i in 1:n) {
A <- akl.list[[i]]
dVarX <- sqrt(mean(A * A))
for (j in i:n) {
B <- akl.list[[j]]
V <- sqrt (dVarX * (sqrt(mean(B * B))))
output[i,j] <- (sqrt(mean(A * B))) / V
}
}
This should already get your code to run faster than before (as you compute akl everytime in the inner loop) on larger matrices.
Second: In addition to that, you can get it faster by parallelising as follows:
# now, the parallelisation you require can be achieved as follows
# with the help of `plyr` and `doMC`.
# First step of parallelisation is to compute akl in parallel
require(plyr)
require(doMC)
registerDoMC(10) # 10 Cores/CPUs
akl.list <- llply(1:nrow(aa), function(i) {
akl(dist(aa[i, ]))
}, .parallel = TRUE)
# then, you could write your for-loop using plyr again as follows
output <- laply(1:n, function(i) {
A <- akl.list[[i]]
dVarX <- sqrt(mean(A * A))
t <- laply(i:n, function(j) {
B <- akl.list[[j]]
V <- sqrt(dVarX * (sqrt(mean(B*B))))
sqrt(mean(A * B))/V
})
c(rep(0, n-length(t)), t)
}, .parallel = TRUE)
Note that I have added .parallel = TRUE only on the outer loop. This is because, you assign 10 processors to the outer loop. Now, if you add it to both outer and inner loops, then the total number of processers will be 10 * 10 = 100. Please take care of this.

R: creating a matrix with unknown number of rows

I have written the code below to generate a matrix containing what is, to me, a fairly complex pattern. In this case I determined that there are 136 rows in the finished matrix by trial and error.
I could write a function to calculate the number of matrix rows in advance, but the function would be a little complex. In this example the number of rows in the matrix = ((4 * 3 + 1) + (3 * 3 + 1) + (2 * 3 + 1) + (1 * 3 + 1)) * 4.
Is there an easy and efficient way to create matrices in R without hard-wiring the number of rows in the matrix statement? In other words, is there an easy way to let R simply add a row to a matrix as needed when using for-loops?
I have presented one solution that employs rbind at each pass through the loops, but that seems a little convoluted and I was wondering if there might be a much easier solution.
Sorry if this question is redundant with an earlier question. I could not locate a similar question using the search feature on this site or using an internet search engine today, although I think I have found a similar question somewhere in the past.
Below are 2 sets of example code, one using rbind and the other where I used trial and error to set nrow=136 in advance.
Thanks for any suggestions.
v1 <- 5
v2 <- 2
v3 <- 2
v4 <- (v1-1)
my.matrix <- matrix(0, nrow=136, ncol=(v1+4) )
i = 1
for(a in 1:v2) {
for(b in 1:v3) {
for(c in 1:v4) {
for(d in (c+1):v1) {
if(d == (c+1)) l.s = 4
else l.s = 3
for(e in 1:l.s) {
my.matrix[i,c] = 1
if(d == (c+1)) my.matrix[i,d] = (e-1)
else my.matrix[i,d] = e
my.matrix[i,(v1+1)] = a
my.matrix[i,(v1+2)] = b
my.matrix[i,(v1+3)] = c
my.matrix[i,(v1+4)] = d
i <- i + 1
}
}
}
}
}
my.matrix2 <- matrix(0, nrow=1, ncol=(v1+4) )
my.matrix3 <- matrix(0, nrow=1, ncol=(v1+4) )
i = 1
for(a in 1:v2) {
for(b in 1:v3) {
for(c in 1:v4) {
for(d in (c+1):v1) {
if(d == (c+1)) l.s = 4
else l.s = 3
for(e in 1:l.s) {
my.matrix2[1,c] = 1
if(d == (c+1)) my.matrix2[1,d] = (e-1)
else my.matrix2[1,d] = e
my.matrix2[1,(v1+1)] = a
my.matrix2[1,(v1+2)] = b
my.matrix2[1,(v1+3)] = c
my.matrix2[1,(v1+4)] = d
i <- i+1
if(i == 2) my.matrix3 <- my.matrix2
else my.matrix3 <- rbind(my.matrix3, my.matrix2)
my.matrix2 <- matrix(0, nrow=1, ncol=(v1+4) )
}
}
}
}
}
all.equal(my.matrix, my.matrix3)
If you have some upper bound on the size of the matrix,
you can create a matrix
large enough to hold all the data
my.matrix <- matrix(0, nrow=v1*v2*v3*v4*4, ncol=(v1+4) )
and truncate it at the end.
my.matrix <- my.matrix[1:(i-1),]
This is the generic form to do it. You can adapt it to your problem
matrix <- NULL
for(...){
...
matrix <- rbind(matriz,vector)
}
where vector contains the row elements
I stumbled upon this solution today: convert the matrix to a data.frame. As new rows are needed by the for-loop those rows are automatically added to the data.frame. Then you can convert the data.frame back to a matrix at the end if you want. I am not sure whether this constitutes something similar to iterative use of rbind. Perhaps it becomes very slow with large data.frames. I do not know.
my.data <- matrix(0, ncol = 3, nrow = 2)
my.data <- as.data.frame(my.data)
j <- 1
for(i1 in 0:2) {
for(i2 in 0:2) {
for(i3 in 0:2) {
my.data[j,1] <- i1
my.data[j,2] <- i2
my.data[j,3] <- i3
j <- j + 1
}
}
}
my.data
my.data <- as.matrix(my.data)
dim(my.data)
class(my.data)
EDIT: July 27, 2015
You can also delete the first matrix statement, create an empty data.frame then convert the data.frame to a matrix at the end:
my.data <- data.frame(NULL,NULL,NULL)
j <- 1
for(i1 in 0:2) {
for(i2 in 0:2) {
for(i3 in 0:2) {
my.data[j,1] <- i1
my.data[j,2] <- i2
my.data[j,3] <- i3
j <- j + 1
}
}
}
my.data
my.data <- as.matrix(my.data)
dim(my.data)
class(my.data)

Resources