Related
I've seen a few solutions to similar problems, but they all require iteration over the number of items to be added together.
Here's my goal: from a list of numbers, find all of the combinations (without replacement) that add up to a certain total. For example, if I have numbers 1,1,2,3,5 and total 5, it should return 5,2,3, and 1,1,3.
I was trying to use combn but it required you to specify the number of items in each combination. Is there a way to do it that allows for solution sets of any size?
This is precisely what combo/permuteGeneral from RcppAlgos (I am the author) were built for. Since we have repetition of specific elements in our sample vector, we will be finding combinations of multisets that meet our criteria. Note that this is different than the more common case of generating combinations with repetition where each element is allowed to be repeated m times. For many combination generating functions, multisets pose problems as duplicates are introduced and must be dealt with. This can become a bottleneck in your code if the size of your data is decently large. The functions in RcppAlgos handle these cases efficiently without creating any duplicate results. I should mention that there are a couple of other great libraries that handle multisets quite well: multicool and arrangements.
Moving on to the task at hand, we can utilize the constraint arguments of comboGeneral to find all combinations of our vector that meet a specific criteria:
vec <- c(1,1,2,3,5) ## using variables from #r2evans
uni <- unique(vec)
myRep <- rle(vec)$lengths
ans <- 5
library(RcppAlgos)
lapply(seq_along(uni), function(x) {
comboGeneral(uni, x, freqs = myRep,
constraintFun = "sum",
comparisonFun = "==",
limitConstraints = ans)
})
[[1]]
[,1]
[1,] 5
[[2]]
[,1] [,2]
[1,] 2 3
[[3]]
[,1] [,2] [,3]
[1,] 1 1 3
[[4]]
[,1] [,2] [,3] [,4] ## no solutions of length 4
These functions are highly optimized and extend well to larger cases. For example, consider the following example that would produce over 30 million combinations:
## N.B. Using R 4.0.0 with new updated RNG introduced in 3.6.0
set.seed(42)
bigVec <- sort(sample(1:30, 40, TRUE))
rle(bigVec)
Run Length Encoding
lengths: int [1:22] 2 1 2 3 4 1 1 1 2 1 ...
values : int [1:22] 1 2 3 4 5 7 8 9 10 11 ...
bigUni <- unique(bigVec)
bigRep <- rle(bigVec)$lengths
bigAns <- 199
len <- 12
comboCount(bigUni, len, freqs = bigRep)
[1] 32248100
All 300000+ results are returned very quickly:
system.time(bigTest <- comboGeneral(bigUni, len, freqs = bigRep,
constraintFun = "sum",
comparisonFun = "==",
limitConstraints = bigAns))
user system elapsed
0.273 0.004 0.271
head(bigTest)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] 1 1 2 3 4 25 26 26 26 27 28 30
[2,] 1 1 2 3 5 24 26 26 26 27 28 30
[3,] 1 1 2 3 5 25 25 26 26 27 28 30
[4,] 1 1 2 3 7 24 24 26 26 27 28 30
[5,] 1 1 2 3 7 24 25 25 26 27 28 30
[6,] 1 1 2 3 7 24 25 26 26 26 28 30
nrow(bigTest)
[1] 280018
all(rowSums(bigTest) == bigAns)
[1] TRUE
Addendum
I must mention that generally when I see a problem like: "finding all combinations that sum to a particular number" my first thought is integer partitions. For example, in the related problem Getting all combinations which sum up to 100 in R, we can easily solve with the partitions library. However, this approach does not extend to the general case (as we have here) where the vector contains specific repetition or we have a vector that contains values that don't easily convert to an integer equivalent (E.g. the vector (0.1, 0.2, 0.3, 0.4) can easily be treated as 1:4, however treating c(3.98486 7.84692 0.0038937 7.4879) as integers and subsequently applying an integer partitions approach would require an extravagant amount of computing power rendering this method useless).
I took your combn idea and looped over the possible sizes of the sets.
func = function(x, total){
M = length(x)
y = NULL
total = 15
for (m in 1:M){
tmp = combn(x, m)
ind = which(colSums(tmp) == total)
if (length(ind) > 0){
for (j in 1:length(ind))
y = c(y, list(tmp[,ind[j]]))
}
}
return (unique(lapply(y, sort)))
}
x = c(1,1,2,3,5,8,13)
> func(x, 15)
[[1]]
[1] 2 13
[[2]]
[1] 1 1 13
[[3]]
[1] 2 5 8
[[4]]
[1] 1 1 5 8
[[5]]
[1] 1 1 2 3 8
Obviously, this will have problems as M grows since tmp will get big pretty quickly and the length of y can't be (maybe?) pre-determined.
Similar to mickey's answer, we can use combn inside another looping mechanism. I'll use lapply:
vec <- c(1,1,2,3,5)
ans <- 5
Filter(length, lapply(seq_len(length(vec)),
function(i) {
v <- combn(vec, i)
v[, colSums(v) == ans, drop = FALSE]
}))
# [[1]]
# [,1]
# [1,] 5
# [[2]]
# [,1]
# [1,] 2
# [2,] 3
# [[3]]
# [,1]
# [1,] 1
# [2,] 1
# [3,] 3
You can omit the Filter(length, portion, though it may return a number of empty matrices. They're easy enough to deal with and ignore, I just thought removing them would be aesthetically preferred.
This method gives you a matrix with multiple candidates in each column, so
ans <- 4
Filter(length, lapply(seq_len(length(vec)),
function(i) {
v <- combn(vec, i)
v[, colSums(v) == ans, drop = FALSE]
}))
# [[1]]
# [,1] [,2]
# [1,] 1 1
# [2,] 3 3
# [[2]]
# [,1]
# [1,] 1
# [2,] 1
# [3,] 2
If duplicates are a problem, you can always do:
Filter(length, lapply(seq_len(length(vec)),
function(i) {
v <- combn(vec, i)
v <- v[, colSums(v) == ans, drop = FALSE]
v[,!duplicated(t(v)),drop = FALSE]
}))
# [[1]]
# [,1]
# [1,] 1
# [2,] 3
# [[2]]
# [,1]
# [1,] 1
# [2,] 1
# [3,] 2
Now here is a solution involving gtools:
# Creating lists of all permutations of the vector x
df1 <- gtools::permutations(n=length(x),r=length(x),v=1:length(x),repeats.allowed=FALSE)
ls1 <- list()
for(j in 1:nrow(df1)) ls1[[j]] <- x[df1[j,1:ncol(df1)]]
# Taking all cumulative sums and filtering entries equaling our magic number
sumsCum <- t(vapply(1:length(ls1), function(j) cumsum(ls1[[j]]), numeric(length(x))))
indexMN <- which(sumsCum == magicNumber, arr.ind = T)
finalList <- list()
for(j in 1:nrow(indexMN)){
magicRow <- indexMN[j,1]
magicCol <- 1:indexMN[j,2]
finalList[[j]] <- ls1[[magicRow]][magicCol]
}
finalList <- unique(finalList)
where x = c(1,1,2,3,5) and magicNumber = 5. This is a first draft, I am sure it can be improved here and there.
Not the most efficient but the most compact so far:
x <- c(1,1,2,3,5)
n <- length(x)
res <- 5
unique(combn(c(x,rep(0,n-1)), n, function(x) x[x!=0][sum(x)==res], FALSE))[-1]
# [[1]]
# [1] 1 1 3
#
# [[2]]
# [1] 2 3
#
# [[3]]
# [1] 5
#
I want to split a large matrix, mt, into a list of sub-matrices, res. The number of rows for each sub-matrix is specified by a vector, len.
For example,
> mt=matrix(c(1:20),ncol=2)
> mt
[,1] [,2]
[1,] 1 11
[2,] 2 12
[3,] 3 13
[4,] 4 14
[5,] 5 15
[6,] 6 16
[7,] 7 17
[8,] 8 18
[9,] 9 19
[10,] 10 20
lens=c(2,3,5)
What I want is a function some_function, that can offer the following result,
> res=some_function(mt,lens)
> res
[[1]]
[,1] [,2]
[1,] 1 11
[2,] 2 12
[[2]]
[,1] [,2]
[1,] 3 13
[2,] 4 14
[3,] 5 15
[[3]]
[,1] [,2]
[1,] 6 16
[2,] 7 17
[3,] 8 18
[4,] 9 19
[5,] 10 20
Speed is a big concern. The faster, the better!
Many thanks!
A function to create index based on length of each value and split the matrix.
mt <- matrix(c(1:20), ncol=2)
# Two arguments: m - matrix, len - length of each group
m_split <- function(m, len){
index <- 1:sum(len)
group <- rep(1:length(len), times = len)
index_list <- split(index, group)
mt_list <- lapply(index_list, function(vec) mt[vec, ])
return(mt_list)
}
m_split(mt, c(2, 3, 5))
$`1`
[,1] [,2]
[1,] 1 11
[2,] 2 12
$`2`
[,1] [,2]
[1,] 3 13
[2,] 4 14
[3,] 5 15
$`3`
[,1] [,2]
[1,] 6 16
[2,] 7 17
[3,] 8 18
[4,] 9 19
[5,] 10 20
Update
I used the following code to compare the performance of each method in this post.
library(microbenchmark)
library(data.table)
# Test case from #missuse
mt <- matrix(c(1:20000000),ncol=10)
lens <- c(20000,15000,(nrow(mt)-20000-15000))
# Functions from #Damiano Fantini
split.df <- function(mt, lens) {
fac <- do.call(c, lapply(1:length(lens), (function(i){ rep(i, lens[i])})))
split(as.data.frame(mt), f = fac)
}
split.mat <- function(mt, lens) {
fac <- do.call(c, lapply(1:length(lens), (function(i){ rep(i, lens[i])})))
lapply(unique(fac), (function(i) {mt[fac==i,]}))
}
# Benchmarking
microbenchmark(m1 = {m_split(mt, lens)}, # #ycw's method
m2 = {pam = rep(1:length(lens), times = lens)
split(data.table(mt), pam)}, # #missuse's data.table method
m3 = {split.df(mt, lens)}, # #Damiano Fantini's data frame method
m4 = {split.mat(mt, lens)}) # #Damiano Fantini's matrix method
Unit: milliseconds
expr min lq mean median uq max neval
m1 167.6896 209.7746 251.0932 230.5920 274.9347 555.8839 100
m2 402.3415 497.2397 554.1094 547.9603 599.7632 787.4112 100
m3 552.8548 657.6245 719.2548 711.4123 769.6098 989.6779 100
m4 166.6581 203.6799 249.2965 235.5856 275.4790 547.4927 100
As we can see, m1 and m4 are the fastest, while there are almost no differences between them, which means it is not needed to convert the matrix to a data frame or a data.table especially if the OP will keep working on the matrix. Working directly on the matrix (m1 and m4) should be sufficient.
If you are OK working with data.frames instead of matrices, you might build a grouping factor/vector according to lens and then use split(). Alternatively, use this grouping vector to subset your matrix and return a list. In this example, I wrapped the two solutions into two functions: .
# your data
mt=matrix(c(1:20),ncol=2)
lens=c(2,3,5)
# based on split
split.df <- function(mt, lens) {
fac <- do.call(c, lapply(1:length(lens), (function(i){ rep(i, lens[i])})))
split(as.data.frame(mt), f = fac)
}
split.df(mt, lens)
# based on subsetting
split.mat <- function(mt, lens) {
fac <- do.call(c, lapply(1:length(lens), (function(i){ rep(i, lens[i])})))
lapply(unique(fac), (function(i) {mt[fac==i,]}))
}
split.mat(mt, lens)
This second option is about ~10 times faster than the other one according to microbenchmark
library(microbenchmark)
microbenchmark({split.df(mt, lens)}, times = 1000)
# median = 323.743 microseconds
microbenchmark({split.mat(mt, lens)}, times = 1000)
# median = 31.7645 microseconds
One aproach is using split, however it can operate on vectors and data.frames so you need to convert the matrix - data.table should be efficient
mt=matrix(c(1:20),ncol=2)
lens=c(2,3,5)
pam = rep(1:length(lens), times = lens)
library(data.table)
mt_split <- split(data.table(mt), pam)
mt_split
#output
$`1`
V1 V2
1: 1 11
2: 2 12
$`2`
V1 V2
1: 3 13
2: 4 14
3: 5 15
$`3`
V1 V2
1: 6 16
2: 7 17
3: 8 18
4: 9 19
5: 10 20
Checking speed
mt=matrix(c(1:20000000),ncol=10)
lens=c(20000,15000,(nrow(mt)-20000-15000))
pam = rep(1:length(lens), times = lens)
system.time(split(data.table(mt), pam))
#output
user system elapsed
0.75 0.20 0.96
So I have taken a look at this question posted before which was used for summing every 2 values in each row in a matrix. Here is the link:
sum specific columns among rows. I also took a look at another question here: R Sum every k columns in matrix which is more similiar to mine. I could not get the solution in this case to work. Here is the code that I am working with...
y <- matrix(1:27, nrow = 3)
y
m1 <- as.matrix(y)
n <- 3
dim(m1) <- c(nrow(m1)/n, ncol(m1), n)
res <- matrix(rowSums(apply(m1, 1, I)), ncol=n)
identical(res[1,],rowSums(y[1:3,]))
sapply(split.default(y, 0:(length(y)-1) %/% 3), rowSums)
I just get an error message when applying this. The desired output is a matrix with the following values:
[,1] [,2] [,3]
[1,] 12 39 66
[2,] 15 42 69
[3,] 18 45 72
To sum consecutive sets of n elements from each row, you just need to write a function that does the summing and apply it to each row:
n <- 3
t(apply(y, 1, function(x) tapply(x, ceiling(seq_along(x)/n), sum)))
# 1 2 3
# [1,] 12 39 66
# [2,] 15 42 69
# [3,] 18 45 72
Transform the matrix to an array and use colSums (as suggested by #nongkrong):
y <- matrix(1:27, nrow = 3)
n <- 3
a <- y
dim(a) <- c(nrow(a), ncol(a)/n, n)
b <- aperm(a, c(2,1,3))
colSums(b)
# [,1] [,2] [,3]
#[1,] 12 39 66
#[2,] 15 42 69
#[3,] 18 45 72
Of course this assumes that ncol(y) is divisible by n.
PS: You can of course avoid creating so many intermediate objects. They are there for didactic purposes.
I would do something similar to the OP -- apply rowSums on subsets of the matrix:
n = 3
ng = ncol(y)/n
sapply( 1:ng, function(jg) rowSums(y[, (jg-1)*n + 1:n ]))
# [,1] [,2] [,3]
# [1,] 12 39 66
# [2,] 15 42 69
# [3,] 18 45 72
I have an N-by-M matrix X, and I need to calculate an N-by-N matrix Y:
Y[i, j] = sum((X[i,] - X[j,]) ^ 2) 0 <= i,j <= N
For now, I have to use nested loops to do it with O(n2). I would like to know if there's a better way, like using matrix operations.
more generally, sum(....) can be a function, fun(x1,x 2) of which x1, x2 are M-by-1 vectors.
you can use expand.grid to get a data.frame of possible pairs:
X <- matrix(sample(1:5, 50, replace=TRUE), nrow=10)
row.ind <- expand.grid(1:dim(X)[1], 1:dim(X)[2])
Then apply along each pair using a function:
myfun <- function(n) {
sum((X[row.ind[n, 1],] - X[row.ind[n, 2],])^2)
}
Y <- matrix(unlist(lapply(1:nrow(row.ind), myfun)), byrow=TRUE, nrow=nrow(X))
> Y
[,1] [,2] [,3] [,4] [,5]
[1,] 0 28 15 31 41
[2,] 31 28 33 30 33
[3,] 28 0 15 7 19
[4,] 33 30 19 34 11
[5,] 15 15 0 12 22
[6,] 10 19 10 21 20
[7,] 31 7 12 0 4
[8,] 16 17 16 13 2
[9,] 41 19 22 4 0
[10,] 14 11 28 9 2
>
I bet there is a better way but its Friday and I'm tired!
(x[i]-x[j])^2 = x[i]² - 2*x[i]*x[j] + x[j]²
and than is middle part just matrix multiplication -2*X*tran(X) (matrix) and other parts are just vetrors and you have to run this over each element
This has O(n^2.7) or whatever matrix multiplication complexity is
Pseudocode:
vec=sum(X,rows).^2
Y=X * tran(X) * -2
for index [i,j] in Y:
Y[i,j] = Y[i,j] + vec[i]+vec[y]
In MATLAB, for your specific f, you could just do this:
Y = pdist(X).^2;
For a non-"cheating" version, try something like this (MATLAB):
[N, M] = size(X);
f = #(u, v) sum((u-v).^2);
helpf = #(i, j) f(X(i, :), X(j, :))
Y = arrayfun(helpf, meshgrid(1:N, 1:N), meshgrid(1:N, 1:N)');
There are more efficient ways of doing it with the specific function sum(...) but your question said you wanted a general way for a general function f. In general this operation will be O(n^2) times the complexity of each vector pair operation because that's how many operations need to be done. If f is of a special form, some calculations' results can be reused.
I have the following data frame and vector.
> y
v1 v2 v3
1 1 6 43
2 4 7 5
3 0 2 32
> v
[1] 1 2 3
I want to apply the following function to every ROW in that data frame such that v is added to every ROW of y:
x <- function(vector1,vector2) {
x <- vector1 + vector2
}
... in order to get THESE results:
v1 v2 v3
1 2 8 46
2 5 9 8
3 1 4 35
mapply applies the function to COLUMNS:
> z <- mapply(x, y, MoreArgs=list(vector2=v))
> z
v1 v2 v3
[1,] 2 7 44
[2,] 6 9 7
[3,] 3 5 35
I've tried transposing the data frame so that the function will be applied to rows and not columns, but mapply gives me weird results after transposing:
> transposed <- t(y)
> transposed
[,1] [,2] [,3]
v1 1 4 0
v2 6 7 2
v3 43 5 32
> z <- mapply(x, transposed, MoreArgs=list(vector2=v))
> z
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 2 7 44 5 8 6 1 3 33
[2,] 3 8 45 6 9 7 2 4 34
[3,] 4 9 46 7 10 8 3 5 35
...Help?
############################ EDIT #########################
Thanks for all the answers! I'm learning tons of new R functions that I've never seen before, which is fantastic.
I want to clarify my earlier question a bit. What I'm really asking is a much more general question - how to apply a multi-parameter function to each row in R (at the moment, I'm tempted to conclude that I should just use a loop, but I would like to figure out if it IS possible, just for future reference...) (I also purposefully refrained from showing the code I'm working with since it's kind of messy).
I tried using the sweep function as was suggested, but I get the following error:
testsweep <- function(vector, z, n) {
testsweep <- z
}
> n <- names(Na_exp)
> n
[1] "NaCl.10000.2hr.AVG_Signal" "NaCl.10000.4hr.AVG_Signal"
> t <- head(Li_fcs,n=1)
> t
LiCl.1000.1hr.FoldChange LiCl.2000.1hr.FoldChange LiCl.5000.1hr.FoldChange
[1,] -0.05371838 -0.1010928 -0.01939986
LiCl.10000.1hr.FoldChange LiCl.1000.2hr.FoldChange
[1,] 0.1275617 -0.107154
LiCl.2000.2hr.FoldChange LiCl.5000.2hr.FoldChange
[1,] -0.06760782 -0.09770226
LiCl.10000.2hr.FoldChange LiCl.1000.4hr.FoldChange
[1,] -0.1124188 -0.06140386
LiCl.2000.4hr.FoldChange LiCl.5000.4hr.FoldChange
[1,] -0.04323497 -0.04275953
LiCl.10000.4hr.FoldChange LiCl.1000.8hr.FoldChange
[1,] 0.03633496 0.01879461
LiCl.2000.8hr.FoldChange LiCl.5000.8hr.FoldChange
[1,] 0.257977 -0.06357423
LiCl.10000.8hr.FoldChange
[1,] 0.07214176
> z <- colnames(Li_fcs)
> z
[1] "LiCl.1000.1hr.FoldChange" "LiCl.2000.1hr.FoldChange"
[3] "LiCl.5000.1hr.FoldChange" "LiCl.10000.1hr.FoldChange"
[5] "LiCl.1000.2hr.FoldChange" "LiCl.2000.2hr.FoldChange"
[7] "LiCl.5000.2hr.FoldChange" "LiCl.10000.2hr.FoldChange"
[9] "LiCl.1000.4hr.FoldChange" "LiCl.2000.4hr.FoldChange"
[11] "LiCl.5000.4hr.FoldChange" "LiCl.10000.4hr.FoldChange"
[13] "LiCl.1000.8hr.FoldChange" "LiCl.2000.8hr.FoldChange"
[15] "LiCl.5000.8hr.FoldChange" "LiCl.10000.8hr.FoldChange"
But when I try to apply sweep...
> test <- sweep(t, 2, z, n, FUN="testsweep")
Error in if (check.margin) { : argument is not interpretable as logical
In addition: Warning message:
In if (check.margin) { :
the condition has length > 1 and only the first element will be used
When I remove the n parameter from this test example, sweep works fine. This suggests to me that sweep cannot be used unless the all parameters provided to sweep are either the same number of columns as the t vector, or of length 1. Please correct me if I am mistaken...
You are asking to "sweeping" v across rows of y with the "+" function:
sweep(y, 1, v, FUN="+")
v1 v2 v3
1 2 7 44
2 6 9 7
3 3 5 35
If your actual problem is really no more complicated than this, you can take advantage of R's recycling rules. You need to transpose y first, then add, then transpose the result because R matrices are stored in column-major order.
t(t(y)+v)
v1 v2 v3
1 2 8 46
2 5 9 8
3 1 4 35
I don't think you need mapply here. Just use t() directly or you can use rep() to make the recycling match as you want:
> set.seed(1)
> mat <- matrix(sample(1:100, 9, TRUE), ncol = 3)
> vec <- 1:3
>
> mat
[,1] [,2] [,3]
[1,] 27 91 95
[2,] 38 21 67
[3,] 58 90 63
#Approach 1 using t()
> ans1 <- t(t(mat) + vec)
#Approach 2 using rep()
> ans2 <- mat + rep(vec, each = nrow(mat))
#Are they the same?
> identical(ans1, ans2)
[1] TRUE
#Hurray!
> ans1
[,1] [,2] [,3]
[1,] 28 93 98
[2,] 39 23 70
[3,] 59 92 66
How about using apply?
t(apply(y, 1, function(x) x + v))
[,1] [,2] [,3]
[1,] 2 8 46
[2,] 5 9 8
[3,] 1 4 35
I don't know why apply returns the row as columms so it needs to be transposed.
I would defintely take a look at mdply form the plyr package. This exactly does what you want to do:
mdply(data.frame(mean = 1:5, sd = 1:5), rnorm, n = 2)