Converting Unsymmetric Vector-List into Matrix - r

How is it possible to convert a list object (with different length) into a matrix object in an efficient way! Following example clarify the afore-mentioned goal:
imagine you have a list object of structure:
l <- list(c(1,2), c(5,7,3,11))
print(l)
# [[1]]
# [1] 1 2
# [[2]]
# [1] 5 7 3 11
The aim is to get a matrix or data.frame in form of:
[,1] [,2] [,3] [,4]
[1,] 1 2 NA NA
[2,] 5 7 3 11
It's very easy to tackle the problem with for-loop. Do you have any idea, how is it possible to make this kind of transformation easily? Thank you in advance!

You could also try
t(sapply(l, `length<-`, max(sapply(l, length))))
# [,1] [,2] [,3] [,4]
#[1,] 1 2 NA NA
#[2,] 5 7 3 11

Here's one way to do it:
n <- max(sapply(l, length))
t(sapply(l, function(x) if(length(x) < n) c(x, rep(NA, n - length(x))) else x))
[,1] [,2] [,3] [,4]
[1,] 1 2 NA NA
[2,] 5 7 3 11
First we find out the maximum vector length per list element and store it in n (which is 4 in this case).
Then, we sapply over the list and check if the length of the list element is equal to n and if it is, return it, if it's shorter than n, return the list element + NA repeated as often as the difference in length. This returns a matrix. We use t() on that matrix to transpose it and get the desired result.

If you're open to using a package, you could also consider stri_list2matrix from the "stringi" package:
library(stringi)
l <- list(c(1,2), c(5,7,3,11))
stri_list2matrix(l, byrow = TRUE)
# [,1] [,2] [,3] [,4]
# [1,] "1" "2" NA NA
# [2,] "5" "7" "3" "11"
Regarding your question about doing this efficiently, #akrun's answer is already pretty efficient, but could be made more efficient by using vapply instead of sapply. The "stringi" approach is also pretty efficient (and has the benefit of not resorting to cryptic code like length<-).
funDD <- function() {
n <- max(sapply(l, length))
t(sapply(l, function(x) if(length(x) < n) c(x, rep(NA, n - length(x))) else x))
}
funAK <- function() t(sapply(l, `length<-`, max(sapply(l, length))))
funAM <- function() {
x <- max(vapply(l, length, 1L))
t(vapply(l, `length<-`, numeric(x), x))
}
funStringi <- function() stri_list2matrix(l, byrow = TRUE)
## Make a big list to test on
set.seed(1)
l <- lapply(sample(3:10, 1000000, TRUE), function(x) sample(10, x, TRUE))
system.time(out1 <- funDD())
# user system elapsed
# 5.81 0.33 7.02
library(microbenchmark)
microbenchmark(funAK(), funAM(), funStringi(), times = 10)
# Unit: seconds
# expr min lq mean median uq max neval
# funAK() 2.350877 2.499963 2.974141 3.123008 3.200545 3.418648 10
# funAM() 1.154151 1.238235 1.337607 1.287610 1.494964 1.508884 10
# funStringi() 2.080901 2.168248 2.352030 2.344763 2.462959 2.716910 10

Related

How to calculate variance between observations, when observations are in matrix form each

I have a cluster of 250 observations. each observation is a 4 by 9 matrix.
4 is number of variable parameters observed and 9 is number of days, observations were collected.
I want to know the variance between 250 observations which are in matrix form. as I ve studied so far, variance is calculated among one dimension variables.
any suggestion for data in matrix form?
mat1 <- matrix(c(0:69),10,7)
mat2 <- matrix(c(3:72),10,7)
mat3 <- matrix(c(0:69),10,7)
...
var <- var(mat1,mat2, mat3,..)
for these three matrices, var() returns a 7 by 7 matrix of 9.166667 for all elements. I do not know what r is doing. or how to get to this.
I think this will reflect what you're hoping for.
First, I'll create a few matrices, very small:
set.seed(42)
mat1 <- matrix(sample(100,12),2,4)
mat2 <- matrix(sample(100,12),2,4)
mat3 <- matrix(sample(100,12),2,4)
From here, I think you want to get
var(c(mat1[1,1], mat2[1,1], mat3[1,1]))
# [1] 193
but for every set of cells in all matrices.
One way to do this is to abind all matrices into a 3D array and then use apply:
ary <- do.call(abind::abind, c(list(mat1, mat2, mat3), along = 3))
ary
# , , 1
# [,1] [,2] [,3] [,4]
# [1,] 49 25 18 47
# [2,] 65 74 100 24
# , , 2
# [,1] [,2] [,3] [,4]
# [1,] 26 41 27 5
# [2,] 3 89 36 84
# , , 3
# [,1] [,2] [,3] [,4]
# [1,] 24 43 22 8
# [2,] 30 15 58 36
apply(ary, 1:2, var)
# [,1] [,2] [,3] [,4]
# [1,] 193.0000 97.33333 20.33333 549
# [2,] 966.3333 1530.33333 1057.33333 1008
Where 193 is the variance of the [1,1] elements, 97.333 is the variance of the [1,2] elements, etc.
The arguments to var are:
> args(var)
function (x, y = NULL, na.rm = FALSE, use)
so mat1 is being passed to x and mat2 to y and mat3 to na.rm. Element i, j of the result is the covariance of x[, i] and y[, j].
The code in the question really all makes no sense and some reading of ?var would help. It is not clear what "I want to know the variance between 250 observations which are in matrix form" means. If it means that v[i, j] should be calculated as the variance of c(mat1[i,j], mat2[i, j], mat3[i, j]) then we can use one of several list comprehension packages or just iterated sapply. They all use the fact that these two are the same for fixed i and j except the first is more general.
var(sapply(L, `[`, i, j))
var(c(L[[1]][i, j], L[[2]][i,j], L[[3]][i,j]))
The syntax for the listcompr alternative seems particularly intuitive here.
L <- list(mat1, mat2, mat3)
nr <- nrow(L[[1]])
nc <- ncol(L[[1]])
library(listcompr)
v1 <- gen.matrix(var(sapply(L, `[`, i, j)), i = 1:nr, j = 1:nc)
# or
library(eList)
v2 <- Mat(for(j in 1:nc) for(i in 1:nr) var(sapply(L, `[`, i, j)))
# or (no packages):
v3 <- sapply(1:nc, \(j) sapply(1:nr, \(i) var(sapply(L, `[`, i, j))))
# checks
identical(v1, v2)
## [1] TRUE
identical(v1, v3)
## [1] TRUE
i <- 2; j <- 3
identical(v1[i, j], var(c(L[[1]][i, j], L[[2]][i,j], L[[3]][i,j])))
## [1] TRUE

How do you convert large list with vectors of different lenght to dataframe? [duplicate]

This question already has answers here:
How to convert a list consisting of vector of different lengths to a usable data frame in R?
(6 answers)
Convert list of vectors to data frame
(4 answers)
Closed 3 years ago.
I have a large list of 30000+ elements. There are vectors of different lenght and I want to convert the list into a dataframe, where each vector represents one line and its values are spread into multiple columns. There is a mock example of the list:
lst <- list(a = c(1,2,4,5,6), c = c(7,8,9), c = c(10,11))
My desired output looks like this:
# [,1] [,2] [,3] [,4] [,5] [,6]
#a 1 2 3 4 5 6
#b 7 8 9 NA NA NA
#c 10 11 NA NA NA NA
You could do:
t(as.data.frame(lapply(lst, "length<-", max(lengths(lst)))))
# [,1] [,2] [,3] [,4] [,5]
#a 1 2 4 5 6
#c 7 8 9 NA NA
#c.1 10 11 NA NA NA
Or as #Andrew pointed out, you can do:
t(sapply(lst, "length<-", max(lengths(lst))))
# [,1] [,2] [,3] [,4] [,5]
#a 1 2 4 5 6
#c 7 8 9 NA NA
#c 10 11 NA NA NA
Here is a one base R option:
# Create a vector for number of times an NA needs to be padded
na_nums <- max(lengths(lst)) - lengths(lst)
# Transpose results after patting NA's using mapply
t(mapply(c, lst, sapply(na_nums, rep, x = NA)))
[,1] [,2] [,3] [,4] [,5]
a 1 2 4 5 6
c 7 8 9 NA NA
c 10 11 NA NA NA
This was my first impulse.
max_len <- max(vapply(lst,
FUN = length,
FUN.VALUE = numeric(1)))
lst <- lapply(lst,
function(x, max_len) c(x, rep(NA, max_len - length(x))),
max_len)
# Form a matrix
do.call("rbind", lst)
It's a bit verbose, and some of the other solutions are rather elegant. Since you say your list is in excess of 30,000 elements, I was curious how these would perform on a list of length 30,000.
If this is something you need to do often, you may want to adopt andrew's approach.
lst <- list(a = c(1,2,4,5,6), c = c(7,8,9), c = c(10,11))
# build out a list of 30,000 elements.
lst <- lst[sample(1:3, 30000, replace = TRUE)]
library(microbenchmark)
microbenchmark(
benjamin = {
max_len <- max(vapply(lst,
FUN = length,
FUN.VALUE = numeric(1)))
lst <- lapply(lst,
function(x, max_len) c(x, rep(NA, max_len - length(x))),
max_len)
# Form a matrix
do.call("rbind", lst)
},
slava = {
Reduce(function(x,y){
n <- max(length(x), length(y))
length(x) <- n
length(y) <- n
rbind(x,y,deparse.level = 0)
},
lst)
},
andrew = {
na_nums <- max(lengths(lst)) - lengths(lst)
# Transpose results after patting NA's using mapply
t(mapply(c, lst, sapply(na_nums, rep, x = NA)))
},
matt = {
t(as.data.frame(lapply(lst, "length<-", max(lengths(lst)))))
}
)
Unit: milliseconds
expr min lq mean median uq max neval
benjamin 77.08337 91.42793 117.9376 106.97656 122.53898 191.6612 5
slava 32383.10840 32962.57589 32976.6662 33071.40314 33180.70634 33285.5372 5
andrew 60.91803 66.74401 87.1645 71.92043 77.78805 158.4520 5
matt 1685.09158 1702.19796 1759.2741 1737.01949 1760.86237 1911.1993 5
The trick is to make vectors of equal length. Also, seems like you want to hava a matrix on output.
Reduce(function(x,y){
n <- max(length(x), length(y))
length(x) <- n
length(y) <- n
rbind(x,y,deparse.level = 0)
},
list(a = c(1,2,4,5,6), c = c(7,8,9), c = c(10,11)))
Output
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 2 4 5 6
# [2,] 7 8 9 NA NA
# [3,] 10 11 NA NA NA
You can reset the row names at this point.
UPDATE
Timings for those whose are interested:
lst <- list(a = c(1,2,4,5,6), c = c(7,8,9), c = c(10,11))
convert <-function(lst){
Reduce(function(x,y){
n <- max(length(x), length(y))
length(x) <- n
length(y) <- n
rbind(x,y,deparse.level = 0)
},
lst)
}
convert2 <- function(lst){
t(sapply(lst, "length<-", max(lengths(lst))))
}
convert3 <- function(lst){
t(as.data.frame(lapply(lst, "length<-", max(lengths(lst)))))
}
microbenchmark::microbenchmark(convert(lst),
convert2(lst),
convert3(lst))
#Unit: microseconds
# expr min lq mean median uq max neval
# convert(lst) 41.962 50.0725 106.47314 62.2375 68.408 4392.895 100
# convert2(lst) 28.209 33.6755 69.93855 40.7280 45.136 2298.002 100
# convert3(lst) 292.673 306.6005 381.59504 319.1180 333.399 2887.929 100

Imputation with column medians in R

If I have a vector, for example
vec <- c(3,4,5,NA)
I can replace the NA with the median value of the other values in the vector with the following code:
vec[which(is.na(vec))] <- median(vec, na.rm = T)
However, if I have a matrix containing NAs, applying this same code across all columns of the matrix doesn't give me back a matrix, just returning the medians of each matrix column.
mat <- matrix(c(1,NA,3,5,6,7,NA,3,4,NA,2,8), ncol = 3)
apply(mat, 2, function(x) x[which(is.na(x))] <- median(x, na.rm=T) )
#[1] 3 6 4
How can I get the matrix back with NAs replaced by column medians? This question is similar: Replace NA values by row means but I can't adapt any of the solutions to my case.
There is a convenient function (na.aggregate) in zoo to replace the NA elements with the specified FUN.
library(zoo)
apply(mat, 2, FUN = function(x) na.aggregate(x, FUN = median))
# [,1] [,2] [,3]
#[1,] 1 6 4
#[2,] 3 7 4
#[3,] 3 6 2
#[4,] 5 3 8
Or as #G.Grothendieck commented, na.aggregate can be directly applied on the matrix
na.aggregate(mat, FUN = median)
Adding return(x) as last line of the function within apply will solve it.
> apply(mat, 2, function(x){
x[which(is.na(x))] <- median(x, na.rm=T)
return(x)
})
[,1] [,2] [,3]
[1,] 1 6 4
[2,] 3 7 4
[3,] 3 6 2
[4,] 5 3 8

compute only upper triangle of matrix

I have vector:
v1 = c(1,2,3)
From this vector I want to create matrix where element on i,j position will be sum of vector members on i,j positions:
[,1] [,2] [,3]
[1,] 2 3 4
[2,] 3 4 5
[3,] 4 5 6
Questions:
i,j and j,i is the same, so there is no reason to compute it 2x
for better performance. How to achieve this?
How to create also variant which will not compute elements if i == j and simply returns NA instead? I'm not asking for diag(m) <- NA command, I'm asking how to prevent computing those elements.
PS: This is reduced version of my problem
There is an approach that is much faster than a straightforward calculation with 2 nested loops. It's not optimized in terms that you described in the question 1, but it's pretty fast because it's vectorized. Maybe, it will be enough for your purpose.
Vectorized (or even matrix) approach itself:
f1 <- function(x){
n <- length(x)
m <- matrix(rep(x,n),n)
m + t(m)
}
> f1(1:3)
[,1] [,2] [,3]
[1,] 2 3 4
[2,] 3 4 5
[3,] 4 5 6
We can also create a function for straightforward approach to perform benchmark. This function does even less than needed: it calculates only upper triangle, but we will see that it's much slower.
f2 <- function(x){
n <- length(x)
m <- matrix(rep(NA,n^2),n)
for(i in 1:(n-1)){
for(j in (i+1):n) m[i,j] <- x[[i]] + x[[j]]
}
m
}
> f2(1:3)
[,1] [,2] [,3]
[1,] NA 3 4
[2,] NA NA 5
[3,] NA NA NA
Benchmark:
library(microbenchmark)
> microbenchmark(f1(1:100), f2(1:100))
Unit: microseconds
expr min lq mean median uq max neval
f1(1:100) 124.775 138.6175 181.6401 187.731 196.454 294.301 100
f2(1:100) 10227.337 10465.1285 11000.1493 10616.830 10907.148 15826.259 100

Multiplying Combinations of a list of lists in R

Given a list of two lists, I am trying to obtain, without using for loops, a list of all element-wise products of the first list with the second. For example:
> a <- list(c(1,2), c(2,3), c(4,5))
> b <- list(c(1,3), c(3,4), c(6,2))
> c <- list(a, b)
The function should return a list with 9 entries, each of size two. For example,
> answer
[[1]]
[1] 1 6
[[2]]
[1] 3 8
[[3]]
[1] 6 4
[[4]]
[1] 2 9
[[5]]
[1] 6 12
etc...
Any suggestions would be much appreciated!
A fast (but memory-intensive) way would be to use the mechanism of mapply in combination with argument recycling, something like this:
mapply(`*`,a,rep(b,each=length(a)))
Gives :
> mapply(`*`,a,rep(b,each=length(a)))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 1 2 4 3 6 12 6 12 24
[2,] 6 9 15 8 12 20 4 6 10
Or replace a with c[[1]] and b with c[[2]] to obtain the same. To get a list, set the argument SIMPLIFY = FALSE.
Have no idea if this is fast or memory intensive just that it works, Joris Meys's answer is more eloquent:
x <- expand.grid(1:length(a), 1:length(b))
x <- x[order(x$Var1), ] #gives the order you asked for
FUN <- function(i) diag(outer(a[[x[i, 1]]], b[[x[i, 2]]], "*"))
sapply(1:nrow(x), FUN) #I like this out put
lapply(1:nrow(x), FUN) #This one matches what you asked for
EDIT: Now that Brian introduced benchmarking (which I love (LINK)) I have to respond. I actually have a faster answer using what I call expand.grid2 that's a lighter weight version of the original that I stole from HERE. I was going to throw it up before but when I saw how fast Joris's is I figured why bother, both short and sweet but also fast. But now that Diggs has dug I figured I'd throw up here the expand.grid2 for educational purposes.
expand.grid2 <-function(seq1,seq2) {
cbind(Var1 = rep.int(seq1, length(seq2)),
Var2 = rep.int(seq2, rep.int(length(seq1),length(seq2))))
}
x <- expand.grid2(1:length(a), 1:length(b))
x <- x[order(x[,'Var1']), ] #gives the order you asked for
FUN <- function(i) diag(outer(a[[x[i, 1]]], b[[x[i, 2]]], "*"))
lapply(1:nrow(x), FUN)
Here's the results (same labeling as Bryan's except TylerEG2 is using the expand.grid2):
Unit: microseconds
expr min lq median uq max
1 DiggsL(a, b) 5102.296 5307.816 5471.578 5887.516 70965.58
2 DiggsM(a, b) 384.912 428.769 443.466 461.428 36213.89
3 Joris(a, b) 91.446 105.210 123.172 130.171 16833.47
4 TylerEG2(a, b) 392.377 425.503 438.100 453.263 32208.94
5 TylerL(a, b) 1752.398 1808.852 1847.577 1975.880 49214.10
6 TylerM(a, b) 1827.515 1888.867 1925.959 2090.421 75766.01
7 Wojciech(a, b) 1719.740 1771.760 1807.686 1924.325 81666.12
And if I take the ordering step out I can squeak out even more but it still isn't close to Joris's answer.
Pulling ideas from the other answers together, I'll throw another one-liner in for fun:
do.call(mapply, c(FUN=`*`, as.list(expand.grid(b, a))))
which gives
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 1 3 6 2 6 12 4 12 24
[2,] 6 8 4 9 12 6 15 20 10
If you really need it in the format you gave, then you can use the plyr library to transform it into that:
library("plyr")
as.list(unname(alply(do.call(mapply, c(FUN=`*`, as.list(expand.grid(b, a)))), 2)))
which gives
[[1]]
[1] 1 6
[[2]]
[1] 3 8
[[3]]
[1] 6 4
[[4]]
[1] 2 9
[[5]]
[1] 6 12
[[6]]
[1] 12 6
[[7]]
[1] 4 15
[[8]]
[1] 12 20
[[9]]
[1] 24 10
Just for fun, benchmarking:
Joris <- function(a, b) {
mapply(`*`,a,rep(b,each=length(a)))
}
TylerM <- function(a, b) {
x <- expand.grid(1:length(a), 1:length(b))
x <- x[order(x$Var1), ] #gives the order you asked for
FUN <- function(i) diag(outer(a[[x[i, 1]]], b[[x[i, 2]]], "*"))
sapply(1:nrow(x), FUN)
}
TylerL <- function(a, b) {
x <- expand.grid(1:length(a), 1:length(b))
x <- x[order(x$Var1), ] #gives the order you asked for
FUN <- function(i) diag(outer(a[[x[i, 1]]], b[[x[i, 2]]], "*"))
lapply(1:nrow(x), FUN)
}
Wojciech <- function(a, b) {
# Matrix with indicies for elements to multiply
G <- expand.grid(1:3,1:3)
# Coversion of G to list
L <- lapply(1:nrow(G),function(x,d=G) d[x,])
lapply(L,function(i,x=a,y=b) x[[i[[2]]]]*y[[i[[1]]]])
}
DiggsM <- function(a, b) {
do.call(mapply, c(FUN=`*`, as.list(expand.grid(b, a))))
}
DiggsL <- function(a, b) {
as.list(unname(alply(t(do.call(mapply, c(FUN=`*`, as.list(expand.grid(b, a))))), 1)))
}
and the benchmarks
> library("rbenchmark")
> benchmark(Joris(b,a),
+ TylerM(a,b),
+ TylerL(a,b),
+ Wojciech(a,b),
+ DiggsM(a,b),
+ DiggsL(a,b),
+ order = "relative",
+ replications = 1000,
+ columns = c("test", "elapsed", "relative"))
test elapsed relative
1 Joris(b, a) 0.08 1.000
5 DiggsM(a, b) 0.26 3.250
4 Wojciech(a, b) 1.34 16.750
3 TylerL(a, b) 1.36 17.000
2 TylerM(a, b) 1.40 17.500
6 DiggsL(a, b) 3.49 43.625
and to show they are equivalent:
> identical(Joris(b,a), TylerM(a,b))
[1] TRUE
> identical(Joris(b,a), DiggsM(a,b))
[1] TRUE
> identical(TylerL(a,b), Wojciech(a,b))
[1] TRUE
> identical(TylerL(a,b), DiggsL(a,b))
[1] TRUE
# Your data
a <- list(c(1,2), c(2,3), c(4,5))
b <- list(c(1,3), c(3,4), c(6,2))
# Matrix with indicies for elements to multiply
G <- expand.grid(1:3,1:3)
# Coversion of G to list
L <- lapply(1:nrow(G),function(x,d=G) d[x,])
lapply(L,function(i,x=a,y=b) x[[i[[2]]]]*y[[i[[1]]]])

Resources