I have the following two matrices:
> dat <- cbind(c(1,1,2,3),c(55,23,65,67))
> dat
[,1] [,2]
[1,] 1 55
[2,] 1 23
[3,] 2 65
[4,] 3 67
> cond <- cbind(c(1,2,3),c(0.9,1,1.1))
> cond
[,1] [,2]
[1,] 1 0.9
[2,] 2 1.0
[3,] 3 1.1
Now, I would like to divide column 2 of dat with column 2 of cond conditional on the rows having the same value in column 1. That is:
55/0.9
23/0.9
65/1
67/1.1
How do I do that easily in R? I am also interested in solutions for data.frames.
Thanks!
You can do this with match assuming cond is unique in column 1:
dat[, 2] / cond[match(dat[, 1], cond[, 1]), 2]
# [1] 61.11111 25.55556 65.00000 60.90909
This will be faster than merge. What match does is it finds the the index of the values in cond that match the value in dat, which you can then use to retrieve the values from cond. This will also work with data frames.
To understand what match is doing, try looking at the result of:
match(dat[, 1], cond[, 1])
As #Anand Mahto suggested, merge the two matrices, then the calculation becomes simple:
df <- merge(dat, cond, by=1)
df[,2]/df[,3]
FWIW,
Rgames> cond<-cbind(1:100,runif(100))
Rgames> dat<-cbind(sample(1:100,1e5,rep=TRUE),runif(1e5))
Rgames> library(microbenchmark)
Rgames> microbenchmark(brodie(dat,cond),shadow(dat,cond),times=10)
Unit: milliseconds
expr min lq median uq
brodie(dat, cond) 4.981001 5.411622 6.082569 21.57764
shadow(dat, cond) 289.586938 304.098892 309.919966 353.00062
max neval
72.83944 10
372.19423 10
Related
I want to split a large matrix, mt, into a list of sub-matrices, res. The number of rows for each sub-matrix is specified by a vector, len.
For example,
> mt=matrix(c(1:20),ncol=2)
> mt
[,1] [,2]
[1,] 1 11
[2,] 2 12
[3,] 3 13
[4,] 4 14
[5,] 5 15
[6,] 6 16
[7,] 7 17
[8,] 8 18
[9,] 9 19
[10,] 10 20
lens=c(2,3,5)
What I want is a function some_function, that can offer the following result,
> res=some_function(mt,lens)
> res
[[1]]
[,1] [,2]
[1,] 1 11
[2,] 2 12
[[2]]
[,1] [,2]
[1,] 3 13
[2,] 4 14
[3,] 5 15
[[3]]
[,1] [,2]
[1,] 6 16
[2,] 7 17
[3,] 8 18
[4,] 9 19
[5,] 10 20
Speed is a big concern. The faster, the better!
Many thanks!
A function to create index based on length of each value and split the matrix.
mt <- matrix(c(1:20), ncol=2)
# Two arguments: m - matrix, len - length of each group
m_split <- function(m, len){
index <- 1:sum(len)
group <- rep(1:length(len), times = len)
index_list <- split(index, group)
mt_list <- lapply(index_list, function(vec) mt[vec, ])
return(mt_list)
}
m_split(mt, c(2, 3, 5))
$`1`
[,1] [,2]
[1,] 1 11
[2,] 2 12
$`2`
[,1] [,2]
[1,] 3 13
[2,] 4 14
[3,] 5 15
$`3`
[,1] [,2]
[1,] 6 16
[2,] 7 17
[3,] 8 18
[4,] 9 19
[5,] 10 20
Update
I used the following code to compare the performance of each method in this post.
library(microbenchmark)
library(data.table)
# Test case from #missuse
mt <- matrix(c(1:20000000),ncol=10)
lens <- c(20000,15000,(nrow(mt)-20000-15000))
# Functions from #Damiano Fantini
split.df <- function(mt, lens) {
fac <- do.call(c, lapply(1:length(lens), (function(i){ rep(i, lens[i])})))
split(as.data.frame(mt), f = fac)
}
split.mat <- function(mt, lens) {
fac <- do.call(c, lapply(1:length(lens), (function(i){ rep(i, lens[i])})))
lapply(unique(fac), (function(i) {mt[fac==i,]}))
}
# Benchmarking
microbenchmark(m1 = {m_split(mt, lens)}, # #ycw's method
m2 = {pam = rep(1:length(lens), times = lens)
split(data.table(mt), pam)}, # #missuse's data.table method
m3 = {split.df(mt, lens)}, # #Damiano Fantini's data frame method
m4 = {split.mat(mt, lens)}) # #Damiano Fantini's matrix method
Unit: milliseconds
expr min lq mean median uq max neval
m1 167.6896 209.7746 251.0932 230.5920 274.9347 555.8839 100
m2 402.3415 497.2397 554.1094 547.9603 599.7632 787.4112 100
m3 552.8548 657.6245 719.2548 711.4123 769.6098 989.6779 100
m4 166.6581 203.6799 249.2965 235.5856 275.4790 547.4927 100
As we can see, m1 and m4 are the fastest, while there are almost no differences between them, which means it is not needed to convert the matrix to a data frame or a data.table especially if the OP will keep working on the matrix. Working directly on the matrix (m1 and m4) should be sufficient.
If you are OK working with data.frames instead of matrices, you might build a grouping factor/vector according to lens and then use split(). Alternatively, use this grouping vector to subset your matrix and return a list. In this example, I wrapped the two solutions into two functions: .
# your data
mt=matrix(c(1:20),ncol=2)
lens=c(2,3,5)
# based on split
split.df <- function(mt, lens) {
fac <- do.call(c, lapply(1:length(lens), (function(i){ rep(i, lens[i])})))
split(as.data.frame(mt), f = fac)
}
split.df(mt, lens)
# based on subsetting
split.mat <- function(mt, lens) {
fac <- do.call(c, lapply(1:length(lens), (function(i){ rep(i, lens[i])})))
lapply(unique(fac), (function(i) {mt[fac==i,]}))
}
split.mat(mt, lens)
This second option is about ~10 times faster than the other one according to microbenchmark
library(microbenchmark)
microbenchmark({split.df(mt, lens)}, times = 1000)
# median = 323.743 microseconds
microbenchmark({split.mat(mt, lens)}, times = 1000)
# median = 31.7645 microseconds
One aproach is using split, however it can operate on vectors and data.frames so you need to convert the matrix - data.table should be efficient
mt=matrix(c(1:20),ncol=2)
lens=c(2,3,5)
pam = rep(1:length(lens), times = lens)
library(data.table)
mt_split <- split(data.table(mt), pam)
mt_split
#output
$`1`
V1 V2
1: 1 11
2: 2 12
$`2`
V1 V2
1: 3 13
2: 4 14
3: 5 15
$`3`
V1 V2
1: 6 16
2: 7 17
3: 8 18
4: 9 19
5: 10 20
Checking speed
mt=matrix(c(1:20000000),ncol=10)
lens=c(20000,15000,(nrow(mt)-20000-15000))
pam = rep(1:length(lens), times = lens)
system.time(split(data.table(mt), pam))
#output
user system elapsed
0.75 0.20 0.96
I have vector:
v1 = c(1,2,3)
From this vector I want to create matrix where element on i,j position will be sum of vector members on i,j positions:
[,1] [,2] [,3]
[1,] 2 3 4
[2,] 3 4 5
[3,] 4 5 6
Questions:
i,j and j,i is the same, so there is no reason to compute it 2x
for better performance. How to achieve this?
How to create also variant which will not compute elements if i == j and simply returns NA instead? I'm not asking for diag(m) <- NA command, I'm asking how to prevent computing those elements.
PS: This is reduced version of my problem
There is an approach that is much faster than a straightforward calculation with 2 nested loops. It's not optimized in terms that you described in the question 1, but it's pretty fast because it's vectorized. Maybe, it will be enough for your purpose.
Vectorized (or even matrix) approach itself:
f1 <- function(x){
n <- length(x)
m <- matrix(rep(x,n),n)
m + t(m)
}
> f1(1:3)
[,1] [,2] [,3]
[1,] 2 3 4
[2,] 3 4 5
[3,] 4 5 6
We can also create a function for straightforward approach to perform benchmark. This function does even less than needed: it calculates only upper triangle, but we will see that it's much slower.
f2 <- function(x){
n <- length(x)
m <- matrix(rep(NA,n^2),n)
for(i in 1:(n-1)){
for(j in (i+1):n) m[i,j] <- x[[i]] + x[[j]]
}
m
}
> f2(1:3)
[,1] [,2] [,3]
[1,] NA 3 4
[2,] NA NA 5
[3,] NA NA NA
Benchmark:
library(microbenchmark)
> microbenchmark(f1(1:100), f2(1:100))
Unit: microseconds
expr min lq mean median uq max neval
f1(1:100) 124.775 138.6175 181.6401 187.731 196.454 294.301 100
f2(1:100) 10227.337 10465.1285 11000.1493 10616.830 10907.148 15826.259 100
Consider I have following matrix
M <- matrix(1:9, 3, 3)
M
# [,1] [,2] [,3]
# [1,] 1 4 7
# [2,] 2 5 8
# [3,] 3 6 9
I just want to find the last element i.e M[3, 3]
As this matrix column and row size are dynamic we can't hardcode it to M[3, 3]
How can I get the value of last element?
Currently I've done using the below code
M[nrow(M), ncol(M)]
# [1] 9
Is there any better way to do it?
A matrix in R is just a vector with a dim attribute, so you can just subset it as one
M[length(M)]
## [1] 9
Though (as mentioned by #James) your solution could be more general in case you want to keep you matrix structure, as you can add drop = FALSE
M[nrow(M), ncol(M), drop = FALSE]
# [,1]
# [1,] 9
Though, my solution could be also modified in a similar manner using the dim<- replacement function
`dim<-`(M[length(M)], c(1,1))
# [,1]
# [1,] 9
Some Benchmarks (contributed by #zx8754)
M <- matrix(runif(1000000),nrow=1000)
microbenchmark(
nrow_ncol={
M[nrow(M),ncol(M)]
},
dim12={
M[dim(M)[1],dim(M)[2]]
},
length1={
M[length(M)]
},
tail1={
tail(c(M),1)
},
times = 1000
)
# Unit: nanoseconds
# expr min lq mean median uq max neval cld
# nrow_ncol 605 1209 3799.908 3623.0 6038 27167 1000 a
# dim12 302 605 2333.241 1811.0 3623 19922 1000 a
# length1 0 303 2269.564 1510.5 3925 14792 1000 a
# tail 1 3103005 3320034 4022028.561 3377234.0 3467487 42777080 1000 b
I would rather do:
tail(c(M),1)
# [1] 9
One way to do this and to avoid unnecessary repetition of the object name (or silly typos) would be to use pipes. Likes this:
require(magrittr)
M %>% .[nrow(.), ncol(.)]
##[1] 9
M %>% `[`(nrow(.), ncol(.))
##[1] 9
M %>% extract(nrow(.), ncol(.))
##[1] 9
The approaches are equivalent, so you can choose whichever feels more intuitive to you.
How is it possible to convert a list object (with different length) into a matrix object in an efficient way! Following example clarify the afore-mentioned goal:
imagine you have a list object of structure:
l <- list(c(1,2), c(5,7,3,11))
print(l)
# [[1]]
# [1] 1 2
# [[2]]
# [1] 5 7 3 11
The aim is to get a matrix or data.frame in form of:
[,1] [,2] [,3] [,4]
[1,] 1 2 NA NA
[2,] 5 7 3 11
It's very easy to tackle the problem with for-loop. Do you have any idea, how is it possible to make this kind of transformation easily? Thank you in advance!
You could also try
t(sapply(l, `length<-`, max(sapply(l, length))))
# [,1] [,2] [,3] [,4]
#[1,] 1 2 NA NA
#[2,] 5 7 3 11
Here's one way to do it:
n <- max(sapply(l, length))
t(sapply(l, function(x) if(length(x) < n) c(x, rep(NA, n - length(x))) else x))
[,1] [,2] [,3] [,4]
[1,] 1 2 NA NA
[2,] 5 7 3 11
First we find out the maximum vector length per list element and store it in n (which is 4 in this case).
Then, we sapply over the list and check if the length of the list element is equal to n and if it is, return it, if it's shorter than n, return the list element + NA repeated as often as the difference in length. This returns a matrix. We use t() on that matrix to transpose it and get the desired result.
If you're open to using a package, you could also consider stri_list2matrix from the "stringi" package:
library(stringi)
l <- list(c(1,2), c(5,7,3,11))
stri_list2matrix(l, byrow = TRUE)
# [,1] [,2] [,3] [,4]
# [1,] "1" "2" NA NA
# [2,] "5" "7" "3" "11"
Regarding your question about doing this efficiently, #akrun's answer is already pretty efficient, but could be made more efficient by using vapply instead of sapply. The "stringi" approach is also pretty efficient (and has the benefit of not resorting to cryptic code like length<-).
funDD <- function() {
n <- max(sapply(l, length))
t(sapply(l, function(x) if(length(x) < n) c(x, rep(NA, n - length(x))) else x))
}
funAK <- function() t(sapply(l, `length<-`, max(sapply(l, length))))
funAM <- function() {
x <- max(vapply(l, length, 1L))
t(vapply(l, `length<-`, numeric(x), x))
}
funStringi <- function() stri_list2matrix(l, byrow = TRUE)
## Make a big list to test on
set.seed(1)
l <- lapply(sample(3:10, 1000000, TRUE), function(x) sample(10, x, TRUE))
system.time(out1 <- funDD())
# user system elapsed
# 5.81 0.33 7.02
library(microbenchmark)
microbenchmark(funAK(), funAM(), funStringi(), times = 10)
# Unit: seconds
# expr min lq mean median uq max neval
# funAK() 2.350877 2.499963 2.974141 3.123008 3.200545 3.418648 10
# funAM() 1.154151 1.238235 1.337607 1.287610 1.494964 1.508884 10
# funStringi() 2.080901 2.168248 2.352030 2.344763 2.462959 2.716910 10
I have a list with three matrixes:
a<-matrix(runif(100))
b<-matrix(runif(100))
c<-matrix(runif(100))
mylist<-list(a,b,c)
I would like to obtain the mean of each element in the three matrices.
I tried: aaply(laply(mylist, as.matrix), c(1, 1), mean) but this returns the means of each matrix instead of taking the mean of each element as rowMeans() would.
Maybe what you want is:
> set.seed(1)
> a<-matrix(runif(4))
> b<-matrix(runif(4))
> c<-matrix(runif(4))
> mylist<-list(a,b,c) # a list of 3 matrices
>
> apply(simplify2array(mylist), c(1,2), mean)
[,1]
[1,] 0.3654349
[2,] 0.4441000
[3,] 0.5745011
[4,] 0.5818541
The vector c(1,2) for MARGIN in the apply call indicates that the function mean should be applied to rows and columns (both at once), see ?apply for further details.
Another alternative is using Reduce function
> Reduce("+", mylist)/ length(mylist)
[,1]
[1,] 0.3654349
[2,] 0.4441000
[3,] 0.5745011
[4,] 0.5818541
The simplify2array option is really slow because it calls the mean function nrow*ncol times:
Unit: milliseconds
expr min lq mean median uq max neval
reduce 7.320327 8.051267 11.23352 12.17859 13.59846 13.72176 10
simplify2array 4233.090223 4674.827077 4802.74033 4808.00417 5010.75771 5228.05362 10
via_vector 27.720372 42.757517 51.95250 59.47917 60.11251 61.83605 10
for_loop 10.405315 12.919731 13.93157 14.46218 15.82175 15.89977 10
l=lapply(1:3,function(i)matrix(i*(1:1e6),10))
microbenchmark(times=10,
Reduce={Reduce(`+`,l)/length(l)},
simplify2array={apply(simplify2array(l),c(1,2),mean)},
via_vector={matrix(rowMeans(sapply(l,as.numeric)),nrow(l[[1]]))},
for_loop={o=l[[1]];for(i in 2:length(l))o=o+l[[i]];o/length(l)}
)
Your question is not clear.
For the mean of all elements of each matrix:
sapply(mylist, mean)
For the mean of every row of each matrix:
sapply(mylist, rowMeans)
For the mean of every column of each matrix:
sapply(mylist, colMeans)
Note that sapply will automatically simplify the results to a vector or matrix, if possible. In the first case, the result will be a vector, but in the second and third, it may be a list or matrix.
Example:
a <- matrix(1:6,2,3)
b <- matrix(7:10,2,2)
c <- matrix(11:16,3,2)
mylist <- list(a,b,c)
> mylist
[[1]]
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
[[2]]
[,1] [,2]
[1,] 7 9
[2,] 8 10
[[3]]
[,1] [,2]
[1,] 11 14
[2,] 12 15
[3,] 13 16
Results:
> sapply(mylist, mean)
[1] 3.5 8.5 13.5
> sapply(mylist, rowMeans)
[[1]]
[1] 3 4
[[2]]
[1] 8 9
[[3]]
[1] 12.5 13.5 14.5
> sapply(mylist, colMeans)
[[1]]
[1] 1.5 3.5 5.5
[[2]]
[1] 7.5 9.5
[[3]]
[1] 12 15