R: Find optimal sub-matrix - r

I have a large symmetric matrix and want to reduce it to the much smaller matrix matrix_small with rows(matrix_small) = n. The mean of matrix_small should be maximized. Is there are way to achieve this goal in R with a better algorithm than I already have? Better is either faster with the same mean or a higher mean with the same speed.
I feel like there should be a smarter way than to search the min so often. But I'm not aware of a way to set an SQL-like index for a Matrix in R to increase performance.
library(microbenchmark)
set.seed(2016)
sym_matrix <- matrix(runif(1e+06), ncol = 1000)
sym_matrix[lower.tri(sym_matrix)] <- t(sym_matrix)[lower.tri(sym_matrix)]
diag(sym_matrix) <- NA
rownames(sym_matrix) <- 1:1000
colnames(sym_matrix) <- 1:1000
findNrows <- function(sym_matrix, nrows){
# Return a matrix with rows(matrix) = nrows.
# mean(matrix) should be maximized
set.seed(2017)
k <- nrow(sym_matrix)
for (i in nrows:(k-1)) { #eliminate rows with minimum values
min_rows <- arrayInd(which.min(sym_matrix), dim(sym_matrix))
choose_row <- sample(min_rows, 1)
sym_matrix <- sym_matrix[-choose_row, -choose_row]
}
sym_matrix
}
microbenchmark(findNrows(sym_matrix = sym_matrix, nrows = 10), times = 25L)
mean(findNrows(sym_matrix = sym_matrix, nrows = 10), na.rm = TRUE)

The problem is one of finding the optimal nrows rows (and likewise columns) from a symmetric matrix that maximizes the sum of the elements in this selected sub-matrix. Unlike the so-called maximum subarray problem in 2D, which has a solution using Kadane's algorithm, the key issue here is that the chosen rows need not be contiguous. As a result, this problem seems to be a much harder combinatorial optimization. A brute force approach that searches over all combinations of nrows rows (here 10) out of N rows (here 1000) is clearly impractical. However, a very simple approach, which is different than the OP's algorithm, is to simply do a random search in the space of all combinations where we randomly select nrows rows (and likewise columns) from the symmetric matrix at each trial and keep the best set of nrows rows across sequential trials:
findNrows.random <- function(sym_matrix, nrows, ntrials){
set.seed(2017)
s.rows <- sample.int(nrow(sym_matrix),nrows)
s <- sym_matrix[s.rows,s.rows]
for (i in 1:ntrials) {
t.rows <- sample.int(nrow(sym_matrix),nrows)
t <- sym_matrix[t.rows,t.rows]
if (sum(s,na.rm=TRUE) < sum(t,na.rm=TRUE)) {
s.rows <- t.rows
s <- t
}
}
return(s)
}
This algorithm, implemented in R, is fast for even large number of trials, and for just 1000 trials it produces a result (for this particular data set and seed) that is surprisingly on-par to the OP's result but roughly 500 times faster. This speaks more towards the sub-optimality of the OP's algorithm than the optimality of random search because 1000 samples is a very small portion of the overall search space. In adition, by construction, the performance in terms of the mean value of the selected sub-matrix is guaranteed to increase as the number of trials increase. Therefore, for the same compute time, simple random search will outperform the OP's algorithm.
## OP results
microbenchmark(findNrows(sym_matrix = sym_matrix, nrows = 10), times = 2L)
##Unit: seconds
## expr min lq mean median uq max neval
## findNrows(sym_matrix = sym_matrix, nrows = 10) 11.67548 11.69193 11.70937 11.6997 11.71076 11.87105 25
mean(findNrows(sym_matrix = sym_matrix, nrows = 10), na.rm = TRUE)
##[1] 0.6256406
## Random search
microbenchmark(findNrows.random(sym_matrix = sym_matrix, nrows = 10, ntrials=1000), times = 25L)
##Unit: milliseconds
## expr min lq mean median uq max neval
## findNrows.random(sym_matrix = sym_matrix, nrows = 10, ntrials = 1000) 21.81462 23.20069 27.06079 23.89368 26.25163 46.77016 25
mean(findNrows.random(sym_matrix = sym_matrix, nrows = 10, ntrials=1000), na.rm = TRUE)
##[1] 0.6374652
Now, instead of throwing away the previous set of selected nrows rows if the next set has a bigger sum, we can seek to improve the performance of this random search by trying to improve the previous set of selected rows using the new set of selected rows. The heurestic we employ is the rowSum for the resulting sub-matrices. That is, at each trial we seek to replace rows in the current sub-matrix with rows in the newly selected sub-matrix that have larger rowSums (or equivalently larger rowMeans). This seems reasonable given that values are uniformly distributed in the full matrix because a row in a selected sub-matrix with a higher rowMean will on average have higher value elements across the full row. Of course, after replace rows (if any) in the current sub-matrix with rows from the newly selected sub-matrix to form the new sub-matrix, we still check to see if this new sub-matrix is better (i.e., has a bigger sum) than the current sub-matrix before replacing the current best sub-matrix for the next trial. The code is as follows:
findNrows.faster <- function(sym_matrix, nrows, ntrials){
set.seed(2017)
s.rows <- sample.int(nrow(sym_matrix),nrows)
s.means <- rowSums(sym_matrix[s.rows,s.rows],na.rm=TRUE)
for (i in 1:ntrials) {
t.rows <- sample.int(nrow(sym_matrix),nrows)
t.means <- rowSums(sym_matrix[t.rows,t.rows],na.rm=TRUE)
st.rows <- c(s.rows,t.rows)
st.means <- c(s.means,t.means)
## need to make sure we do not have duplicates before we choose the best nrows
dups <- duplicated(st.rows)
st.rows <- st.rows[!dups]
st.means <- st.means[!dups]
new.rows <- st.rows[order(st.means,decreasing=TRUE)[1:nrows]]
new.means <- rowSums(sym_matrix[new.rows,new.rows],na.rm=TRUE)
if (sum(s.means) < sum(new.means)) {
s.rows <- new.rows
s.means <- new.means
}
}
sym_matrix[s.rows,s.rows]
}
This algorithm is slower but the result is significantly better than plain random search. Note that the comparison in performance with findNrows.random is apples to apples since the same number of trials are used and the same seed is used to randomly select the same rows for each trial. However, note that we would expect the optimal algorithm to select a sub-matrix with a mean that is well over 0.9, so this algorithm is far from optimal.
## Improved random search
microbenchmark(findNrows.faster(sym_matrix = sym_matrix, nrows = 10, ntrials=1000), times = 25L)
##Unit: milliseconds
## expr min lq mean median uq max neval
## findNrows.faster(sym_matrix = sym_matrix, nrows = 10, ntrials = 1000) 135.0531 136.3961 137.1123 136.7667 137.0439 143.0155 25
mean(findNrows.faster(sym_matrix = sym_matrix, nrows = 10, ntrials=1000), na.rm = TRUE)
##[1] 0.7797313

Related

Is there a quick way in R to predict the class outcome of an observation from a nearest neighbours model from RANN?

I am trying to identify the most probable group that an observation belongs to, for several thousand large datasets. It is possible that some of the data is incorrectly classified and I am trying to work out the most likely "true" value. I have tried to use knn3 from the caret package but the predictions take too long to compute. In researching alternatives I have came across the nn2 function from RANN package which performs a nearest neighbour search that is significantly faster than K-Nearest Neighbours.
library(RANN)
library(tidyverse)
iris.scaled <- iris %>%
mutate_if(is.numeric, scale)
iris.nn2 <- nn2(iris.scaled[1:4])
The result on the nn2 function is two lists, one of indices and one of distances. I want to use the indices table to work out the most likely grouping of each observation, however it returns the row number of the observation and not it's group. I need to replace this with the group it belongs to (in this case, the species column).
distance.index <- iris.nn2$nn.idx[,-1]
target = iris.scaled$Species
I have removed the first column as the first nearest neighbour is always the observation itself.
matrix(target[distance.index[,]], nrow = nrow(distance.index), ncol = ncol(distance.index))
This code gives me the output I want, but is there a tidier way of creating this table and then calculating the most common response for each row, with the speed of calculation being the key.
Your scaling can be a real bottleneck when you have more columns (tested on 200 x 22216 gene expression matrix). My version might not seem that impressive with the iris dataset, but on the larger dataset I get 1.3 sec vs. 32.8 sec execution time.
Using tabulate instead of table gives an additional improvement, which is dwarfed, however, by the matrix scaling.
I used a custom scale function here, but using base::scale on a matrix would already be a major improvement.
I also addressed the issue raised by M. Papenberg of "self" not being considered the nearest neighbor by setting those to NA.
invisible(lapply(c("tidyverse", "matrixStats", "RANN", "microbenchmark", "compiler"),
require, character.only=TRUE))
enableJIT(3)
# faster column scaling (modified from https://www.r-bloggers.com/author/strictlystat/)
colScale <- function(x, center = TRUE, scale = TRUE, rows = NULL, cols = NULL) {
if (!is.null(rows) && !is.null(cols)) {x <- x[rows, cols, drop = FALSE]
} else if (!is.null(rows)) {x <- x[rows, , drop = FALSE]
} else if (!is.null(cols)) x <- x[, cols, drop = FALSE]
cm <- colMeans(x, na.rm = TRUE)
if (scale) csd <- matrixStats::colSds(x, center = cm, na.rm = TRUE) else
csd <- rep(1, length = length(cm))
if (!center) cm <- rep(0, length = length(cm))
x <- t((t(x) - cm) / csd)
return(x)
}
# your posted version (mostly):
oldv <- function(){
iris.scaled <- iris %>%
mutate_if(is.numeric, scale)
iris.nn2 <- nn2(iris.scaled[1:4])
distance.index <- iris.nn2$nn.idx[,-1]
target = iris.scaled$Species
category_neighbours <- matrix(target[distance.index[,]], nrow = nrow(distance.index), ncol = ncol(distance.index))
class <- apply(category_neighbours, 1, function(x) {
x1 <- table(x)
names(x1)[which.max(x1)]})
cbind(iris, class)
}
## my version:
myv <- function(){
iris.scaled <- colScale(data.matrix(iris[, 1:(dim(iris)[2]-1)]))
iris.nn2 <- nn2(iris.scaled)
# set self neighbors to NA
iris.nn2$nn.idx[iris.nn2$nn.idx - seq_len(dim(iris.nn2$nn.idx)[1]) == 0] <- NA
# match up categories
category_neighbours <- matrix(iris$Species[iris.nn2$nn.idx[,]],
nrow = dim(iris.nn2$nn.idx)[1], ncol = dim(iris.nn2$nn.idx)[2])
# turn category_neighbours into numeric for tabulate
cn <- matrix(as.numeric(factor(category_neighbours, exclude=NULL)),
nrow = dim(iris.nn2$nn.idx)[1], ncol = dim(iris.nn2$nn.idx)[2])
cnl <- levels(factor(category_neighbours, exclude = NULL))
# tabulate frequencies and match up with factor levels
class <- apply(cn, 1, function(x) {
cnl[which.max(tabulate(x, nbins=length(cnl))[!is.na(cnl)])]})
cbind(iris, class)
}
microbenchmark(oldv(), myv(), times=100L)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> oldv() 11.015986 11.679337 12.806252 12.064935 12.745082 33.89201 100 b
#> myv() 2.430544 2.551342 3.020262 2.612714 2.691179 22.41435 100 a

Is it possible to do your own efficient descriptive statistics function? - R

Usually, I find myself using a few summary functions or making my own computations to get some additional initial information from the data. For example, I wanted to see the count and percentage per variable given a limit of distinct values:
table_transposed <- function(vector){
merge(as.data.frame(table(vector, dnn="values")),
as.data.frame(round(prop.table(table(vector, dnn="values")),2)),
by="values",
all.x=TRUE) %>%
data.table::transpose(keep.names = "values",
make.names = names(.)[1]) %T>%
{.[,c("values")] <- c("Count", "Percentage")}
}
table_transposed_filter <- function(dataframe, max_number_categories) {
(lapply(dataframe, function(x) NROW(unique(x))) <= max_number_categories) %>%
as.vector() %>%
{dataframe[,.]} %>%
lapply(table_transposed)
}
So, you give the dataframe and the threshold of distinct values per variable.
table_transposed_filter(mtcars, 10)
However, it's SUPER slow (maybe because of using merge() instead of left_join() from dplyr). Now, I'm trying to figure an efficient, fast, and simple way to do a combination of psych::describe(), Hmisc::describe(), other, and my own, for numeric and categorical variables (one descriptive function for each one). Something like (for numerical):
| Variable | dtype | mean | mode | variance | skew | percentile 25 | ...
If I create this table with mainly with sapply() for example, is it better (more efficient, faster, simpler code) than actually learning to create a r-package and developing in there?
PS: I thought to put this question in StackMetaExchange or Crossvalidation, but none of them seem to fit it.
Here's a somewhat faster version. It's about 2x faster on small data (like mtcars), but the difference narrows on litte bit on larger data.
This makes sense as the most expensive operation you do is table - your version does it twice, my version does it once. I didn't profile the code, but my guess is table is the bottleneck by more than one order of magnitude on any sizeable data, so it's a waste to try to optimize any other parts of the code.
t_transp = function(x, digits = 2) {
tab = table(x)
prop_tab = prop.table(tab)
df = data.frame(values = c("Count", "Percentage"))
df = cbind(df, rbind(tab, round(prop_tab, digits = digits)))
row.names(df) = NULL
df
}
t_transp_filter = function(data, n_max, ...) {
lapply(Filter(function(x) NROW(unique(x)) <= n_max, data), t_transp, ...)
}
Benchmarking:
microbenchmark::microbenchmark(
gregor = t_transp_filter(mtcars, n_max = 4),
OP = table_transposed_filter(mtcars, 4),
times = 20
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# gregor 1.6483 1.7093 2.253425 1.74765 1.84680 7.5394 20 a
# OP 5.6988 5.7627 6.316295 6.08545 6.57965 8.1048 20 b
set.seed(47)
df = as.data.frame(matrix(
c(sample(letters[1:5], size = 1e5 * 20, replace = T))
, ncol = 20))
microbenchmark::microbenchmark(
gregor = t_transp_filter(df, n_max = 5),
OP = table_transposed_filter(df, 5),
times = 20
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# gregor 59.5466 59.95545 63.6825 61.14075 67.2167 75.4270 20 a
# OP 110.3265 117.35585 123.8782 118.91005 133.7795 149.0651 20 b

function that computes the average of 20 rolls several thousand times and estimates the expectation and the variance of Y

I need to write a function according to the info in the title. I'm trying to perform that with the following code:
my.function <- function(x=1:6,Nsample=20,prob1=NULL) {
rolling.die <- sample(x, size=Nsample, replace=TRUE, prob=prob1)
for (die in 1:10000) {
die.sum <- sum(rolling.die)
average <- die.sum/Nsample
}
return(var(average))
}
my.function()
But I always get N/A as a result. Could you, please, help me to understand what I am doing wrong?
You need replicate() -
set.seed(2)
test <- replicate(1000, mean(sample(1:6, 20, replace = T)))
# for expectation
mean(test)
[1] 3.50025
# for variance
var(test)
[1] 0.147535
average is a number. It does not make sense to calculate variance of a number. What is the variance of 5 ? Variance is applied to a collection of numbers. So your average must be a vector.
A more efficient approach is to generate all your data ahead of time. As long as you have the memory, this would be a very fast approach:
# sim parameters
n_rolls <- 20L #L means integer variables
n_sim <- 10000L
n_sides <- 6L
#generate data
set.seed(2)
sims <- sample(n_sides, n_rolls * n_sim, replace = T)
#make into matrix of n_sims x n_rolls
mat <- matrix(sims, ncol = n_rolls)
#mean of each simulation
rowMeans(mat)
#var of everything
var(rowMeans(mat))
This is around 14x faster than using replicate as this approach calls sample() once.
Unit: milliseconds
expr min lq mean median uq max neval
shree_replic 137.7283 138.9809 145.78485 142.34755 147.2499 172.4633 10
cole_samp_mat 11.3998 11.4477 11.57025 11.52105 11.7628 11.8218 10
As far as your current function, it doesn't make sense - the loop doesn't do anything. It just does the same calculation 10,000 times and as #user31264 points out, tries to calculate the var of a scalar after the loop. I think you mean to do something like:
my.function2 <- function(x=1:6,Nsample=20,prob1=NULL) {
rolling.die <- sample(x, size=Nsample, replace=TRUE, prob=prob1)
return(mean(rolling.die))
}
means <- vector(mode = 'double', length = n_sim)
for (i in 1:n_sim){
means[i] <- my.function2()
}
#which is equivalent to
means <- sapply(1:n_sim, my.function2)
#which is also equivalent to
means <- replicate(n_sim, my.function2())
var(means)
And #shree has a much more succinct version of your function.

Optimising sapply() or for(), paste(), to efficiently transform sparse triplet matrix to a libsvm format

I have a piece of R code I want to optimise for speed working with larger datasets. It currently depends on sapply cycling through a vector of numbers (which correspond to rows of a sparse matrix). The reproducible example below gets at the nub of the problem; it is the three line function expensive() that chews up the time, and its obvious why (lots of matching big vectors to eachother, and two nested paste statements for each cycle of the loop). Before I give up and start struggling with doing this bit of the work in C++, is there something I'm missing? Is there a way to vectorize the sapply call that will make it an order of magnitude or three faster?
library(microbenchmark)
# create an example object like a simple_triple_matrix
# number of rows and columns in sparse matrix:
n <- 2000 # real number is about 300,000
ncols <- 1000 # real number is about 80,000
# number of non-zero values, about 10 per row:
nonzerovalues <- n * 10
stm <- data.frame(
i = sample(1:n, nonzerovalues, replace = TRUE),
j = sample(1:ncols, nonzerovalues, replace = TRUE),
v = sample(rpois(nonzerovalues, 5), replace = TRUE)
)
# It seems to save about 3% of time to have i, j and v as objects in their own right
i <- stm$i
j <- stm$j
v <- stm$v
expensive <- function(){
sapply(1:n, function(k){
# microbenchmarking suggests quicker to have which() rather than a vector of TRUE and FALSE:
whichi <- which(i == k)
paste(paste(j[whichi], v[whichi], sep = ":"), collapse = " ")
})
}
microbenchmark(expensive())
The output of expensive is a character vector, of n elements, that looks like this:
[1] "344:5 309:3 880:7 539:6 338:1 898:5 40:1"
[2] "307:3 945:2 949:1 130:4 779:5 173:4 974:7 566:8 337:5 630:6 567:5 750:5 426:5 672:3 248:6 300:7"
[3] "407:5 649:8 507:5 629:5 37:3 601:5 992:3 377:8"
For what its worth, the motivation is to efficiently write data from a sparse matrix format - either from slam or Matrix, but starting with slam - into libsvm format (which is the format above, but with each row beginning with a number representing a target variable for a support vector machine - omitted in this example as it's not part of the speed problem). Trying to improve on the answers to this question. I forked one of the repositories referred to from there and adapted its approach to work with sparse matrices with these functions. The tests show that it works fine; but it doesn't scale up.
Use package data.table. Its by combined with the fast sorting saves you from finding the indices of equal i values.
res1 <- expensive()
library(data.table)
cheaper <- function() {
setDT(stm)
res <- stm[, .(i, jv = paste(j, v, sep = ":"))
][, .(res = paste(jv, collapse = " ")), keyby = i][["res"]]
setDF(stm) #clean-up which might not be necessary
res
}
res2 <- cheaper()
all.equal(res1, res2)
#[1] TRUE
microbenchmark(expensive(),
cheaper())
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# expensive() 127.63343 135.33921 152.98288 136.13957 138.87969 222.36417 100 b
# cheaper() 15.31835 15.66584 16.16267 15.98363 16.33637 18.35359 100 a

R: Creating a vector with a specific amount of random numbers

I was hoping someone could help point me in the right direction to create a vector in R, containing a defined amount of randomly generated numbers. I am a complete newbie to R, and I have learned that the concatenate function is used for creating vectors. However, I wish to populate the vector with 50 random numbers. I do not wish to specify a range or any other conditions for the numbers.
MyVectorObject <- c(...)
Any suggestions would be greatly appreciated!
It depends on which numbers you want to generate. These are some options.
x1 <- rpois(n = 50, lambda = 10)
x2 <- runif(n = 50, min = 1, max = 10)
x3 <- sample(x = c(1, 3, 5), size = 50, replace = TRUE)
If we are talking about integer numbers, you want to generate number in interval <-base::.Machine$integer.max, base::.Machine$integer.max> which is for example on my computer interval <-2147483647,2147483647>
Implementation
you can use base::sample to generate positive numbers from 1 to base::.Machine$integer.max
random.pos <- function(N) {
int.max <- .Machine$integer.max
return(sample(int.max, N, replace=TRUE))
}
if you want also negative numbers, you can use
random.posneg <- function(N) {
int.max <- .Machine$integer.max
random.numbers <- sample(int.max, N, replace = TRUE)
random.signs <- sample(c(1,-1), N, replace=TRUE)
return(random.numbers * random.signs)
}
NOTE: No one from functions specified above does generate 0 (zero)
The best approach (by my opinion) is to use base::runif function.
random.runif <- function(N) {
int.max <- .Machine$integer.max
int.min <- -int.max
return(as.integer(runif(N, int.min, int.max)))
}
This will be little bit slower then using base::sample but you get random numbers uniformly distributed with possible zero.
Benchmark
library(microbenchmark)
require(compiler)
random.runif <- cmpfun(random.runif)
random.pos <- cmpfun(random.pos)
random.posneg <- cmpfun(random.posneg)
N <- 500
op <- microbenchmark(
RUNIF = random.runif(N),
POS = random.pos(N),
POSNEG = random.posneg(N),
times = 10000
)
print(op)
## library(ggplot2)
## boxplot(op)
## qplot(y=time, data=op, colour=expr) + scale_y_log10()
and results from the benchmark above
Unit: microseconds
expr min lq mean median uq max neval
RUNIF 13.423 14.251 15.197122 14.482 14.694 2425.290 10000
POS 4.174 5.043 5.613292 5.317 5.645 2436.909 10000
POSNEG 11.673 12.845 13.383194 13.285 13.800 60.304 10000

Resources