3D array -> apply -> 3D array - r

It seems apply will not re-assemble 3D arrays when operating on just one margin. Consider:
arr <- array(
runif(2*4*3),
dim=c(2, 4, 3),
dimnames=list(a=paste0("a", 1:2), b=paste0("b", 1:4), c=paste0("c", 1:3))
)
# , , c = c1
#
# b
# a b1 b2 b3 b4
# a1 0.7321399 0.8851802 0.2469866 0.9307044
# a2 0.5896138 0.6183046 0.7732842 0.6652637
#
# , , c = c2
# b
# a b1 b2 b3 b4
# a1 0.5894680 0.7839048 0.3854357 0.56555024
# a2 0.6158995 0.6530224 0.8401427 0.04044974
#
# , , c = c3
# b
# a b1 b2 b3 b4
# a1 0.3500653 0.7052743 0.42487635 0.5689287
# a2 0.4097346 0.4527939 0.07192528 0.8638655
Now, make a 4 x 4 matrix to shuffle columns around in each of arr[, , i], and use apply to matrix multiply each a*b sub-matrix in arr to re-order their columns. The important point is that the result of each apply iteration is a matrix
cols.shuf.mx <- matrix(c(0,1,0,0,1,0,0,0,0,0,0,1,0,0,1,0), ncol=4)
apply(arr, 3, `%*%`, cols.shuf.mx)
# c
# c1 c2 c3
# [1,] 0.8851802 0.78390483 0.70527431
# [2,] 0.6183046 0.65302236 0.45279387
# [3,] 0.7321399 0.58946800 0.35006532
# [4,] 0.5896138 0.61589947 0.40973463
# [5,] 0.9307044 0.56555024 0.56892870
# [6,] 0.6652637 0.04044974 0.86386552
# [7,] 0.2469866 0.38543569 0.42487635
# [8,] 0.7732842 0.84014275 0.07192528
Whereas, I expected the result to be:
# , , c = c1
#
# a 1 2 3 4
# a1 0.8851802 0.7321399 0.9307044 0.2469866
# a2 0.6183046 0.5896138 0.6652637 0.7732842
#
# , , c = c2
#
# a 1 2 3 4
# a1 0.7839048 0.5894680 0.56555024 0.3854357
# a2 0.6530224 0.6158995 0.04044974 0.8401427
#
# , , c = c3
#
# a 1 2 3 4
# a1 0.7052743 0.3500653 0.5689287 0.42487635
# a2 0.4527939 0.4097346 0.8638655 0.07192528
I can get the expected result with plyr::aaply with:
aperm(aaply(arr, 3, `%*%`, cols.shuf.mx), c(2, 3, 1))
but was wondering if there is a simple base way to achieve this result (i.e. am I missing something obvious here to get the desired outcome).
I realize what occurs here is what is documented (If each call to FUN returns a vector of length n, then apply returns an array of dimension c(n, dim(X)[MARGIN]) if n > 1), but it still seems weird to me that if a function returns an object with dimensions they are basically ignored.

Here is a less than fantastic solution that requires foreknowledge of the dimensions of the function result matrix:
vapply(
1:dim(arr)[3],
function(x, y) arr[,,x] %*% y,
FUN.VALUE=arr[,,1],
y=cols.shuf.mx
)

If you read the help page for apply, it basically agrees with your first sentence. It is set up with a particular design and you would need to construct a new function to do something differently. BTW: This gives you the same result much more simply than that aperm(aaply(...)) rigamarole:
arr[ , c(2,1,4,3) , ]
#-------------------------
, , c = c1
b
a b2 b1 b4 b3
a1 0.4089769 0.2875775 0.5281055 0.9404673
a2 0.8830174 0.7883051 0.8924190 0.0455565
, , c = c2
b
a b2 b1 b4 b3
a1 0.9568333 0.5514350 0.1029247 0.6775706
a2 0.4533342 0.4566147 0.8998250 0.5726334
, , c = c3
b
a b2 b1 b4 b3
a1 0.3279207 0.24608773 0.6405068 0.8895393
a2 0.9545036 0.04205953 0.9942698 0.6928034

Related

replace array values according to values in the first index in the first dimension

I have an array with a few dimensions. I want to replace values according to values in the first index in the first dimension. In the example below, I want to change all values that the corresponding a1 dimension == 2. If I change only one index:
set.seed(2)
arr <- array(data=sample(1:2, 18, replace = TRUE), dim=c(3,3,2), dimnames=list(paste0("a",1:3),paste0("b",1:3),paste0("c",1:2)))
# replace second index according to first index of dimension 1
arr[2,,][arr[1,,]==2] <- NA
The result is as expected:
> arr
, , c1
b1 b2 b3
a1 1 2 1
a2 1 NA 1
a3 2 2 1
, , c2
b1 b2 b3
a1 2 2 1
a2 NA NA 2
a3 1 1 2
But if I try to change all other indexes like this:
set.seed(2)
arr <- array(data=sample(1:2, 18, replace = TRUE), dim=c(3,3,2), dimnames=list(paste0("a",1:3),paste0("b",1:3),paste0("c",1:2)))
# replace 2nd & 3rd index according to first index of dimension 1
arr[2:3,,][arr[1,,]==2] <- NA
It doesn't work as I expect. The indexes in an array is difficult to understand. How do I do it correctly? (naturally, without changing each index separately). Thanks.
I expect the result to be:
> arr
, , c1
b1 b2 b3
a1 1 2 1
a2 1 NA 1
a3 2 NA 1
, , c2
b1 b2 b3
a1 2 2 1
a2 NA NA 2
a3 NA NA 2
You can use rep to get the right indices for subsetting.
arr[2:3,,][rep(arr[1,,]==2, each=2)] <- NA
arr
#, , c1
#
# b1 b2 b3
#a1 1 2 1
#a2 1 NA 1
#a3 2 NA 1
#
#, , c2
#
# b1 b2 b3
#a1 2 2 1
#a2 NA NA 2
#a3 NA NA 2
Or more generally.
i <- 2:dim(arr)[1]
arr[i,,][rep(arr[1,,]==2, each=length(i))] <- NA
Or (Thanks to #jblood94 for this variant)
arr[-1,,][rep(arr[1,,]==2, each = nrow(arr) - 1)] <- NA
Or using a loop.
for(i in 2:nrow(arr)) arr[i,,][arr[1,,]==2] <- NA
It would be
arr[2:3,,][rep(arr[1,,]==2, each = 2)] <- NA
Or, more generally, to replace all rows based on the first row:
arr[-1,,][rep(arr[1,,]==2, each = nrow(arr) - 1)] <- NA

Homogenize use of single and double digit numbers in string

I have a very large data.table in which (a large number of) items are defined by strings including text and numbers.
library(data.table)
dd <- data.table(x = c("A4","A4","A4","A14","A14","A14","B4","B4","B4"),y = c("A4","A14","B4","A4","A14","B4","A4","A14","B4"), z = c(1,2,3,4,5,6,7,8,9))
x y z
A4 A4 1
A4 A14 2
A4 B4 3
A14 A4 4
A14 A14 5
A14 B4 6
B4 A4 7
B4 A14 8
B4 B4 9
Numbers can be single or double digit and therefore R will order them always according to the first digit in the number (A14 before A4). Mixedsort can handle this. However, when I reshape the long data to wide
wide <- dcast(dd, x ~ y, value.var = "z")
R is applying again the ordering according to the basic ordering rule.
x A14 A4 B4
A14 5 4 6
A4 2 1 3
B4 8 7 9
I need however the original ordering for following matrix calculations. Is there any efficient way to rename string + single digits to string + double digits (A4 -> A04) or another approach I have missed?
Another, and probably the easiest, option is to use mixedorder from the gtools-package:
wide <- dcast(dd, x ~ y, value.var = "z")[gtools::mixedorder(x)]
which gives:
> wide
x A14 A4 B4
1: A4 2 1 3
2: A14 5 4 6
3: B4 8 7 9
If you also want to get the column order set the same way, you can additionally use setcolorder:
setcolorder(wide, c(1, gtools::mixedorder(names(wide)[-1]) + 1))
which then gives:
> wide
x A4 A14 B4
1: A4 1 2 3
2: A14 4 5 6
3: B4 7 8 9
No additional zeros required in this solution.
# Data frame
df <- data.frame(x = c("A4","A4","A4","A14","A14","A14","B4","B4","B4"),
y = c("A4","A14","B4","A4","A14","B4","A4","A14","B4"),
z = c(1,2,3,4,5,6,7,8,9),
stringsAsFactors = FALSE)
# Reorder columns and rows using `mixedsort`.
wide <- dcast(df, x ~ y,value.var = "z") %>%
select(x, mixedsort(unique(df$x))) %>%
slice(match(x, mixedsort(unique(df$x))))
gives,
# x A4 A14 B4
# 1 A4 1 2 3
# 2 A14 4 5 6
# 3 B4 7 8 9
You can use sprintf() to prepad numbers with 0s
sprintf("%s%02.0d", "A", 1:20)
# [1] "A01" "A02" "A03" "A04" "A05" "A06" "A07" "A08" "A09" "A10" "A11" "A12" "A13" "A14" "A15" "A16" "A17" "A18" "A19" "A20"
You can add the 0s to your data with
dd[nchar(x) == 2, x := paste0(substr(x, 1, 1), 0, substr(x, 2, 2))]
dd[nchar(y) == 2, y := paste0(substr(y, 1, 1), 0, substr(y, 2, 2))]
# x y z
# 1: A04 A04 1
# 2: A04 A14 2
# 3: A04 B04 3
# 4: A14 A04 4
# 5: A14 A14 5
# 6: A14 B04 6
# 7: B04 A04 7
# 8: B04 A14 8
# 9: B04 B04 9
Or, if you need to apply to more columns:
to.change <- c('x', 'y')
dd[, (to.change) := lapply(.SD, function(x) ifelse(nchar(x) > 2, x
, paste0(substr(x, 1, 1), 0, substr(x, 2, 2))))
, .SDcols = to.change]
You might want to consider implementing this order directly in the data through factors, so you don't have to fix it with data wrangling later.
if you already have these unique values sorted somewhere you won't need mixedorder not mixedsort, just convert them as factors then.
Else you can get the order back :
library(gtools)
dd[,1:2] <- lapply(dd[,1:2],function(x) factor(x, mixedsort(unique(x))))
And proceed normally:
dcast(dd, x ~ y, value.var = "z")
# x A4 A14 B4
# 1: A4 1 2 3
# 2: A14 4 5 6
# 3: B4 7 8 9

cbind with partially nested list

I'm trying to cbind or unnest or as.data.table a partially nested list.
id <- c(1,2)
A <- c("A1","A2","A3")
B <- c("B1")
AB <- list(A=A,B=B)
ABAB <- list(AB,AB)
nested_list <- list(id=id,ABAB=ABAB)
The length of id is the same as ABAB (2 in this case). I don't know how to unlist a part of this list (ABAB) and cbind another part (id). Here's my desired result as a data.table:
data.table(id=c(1,1,1,2,2,2),A=c("A1","A2","A3","A1","A2","A3"),B=rep("B1",6))
id A B
1: 1 A1 B1
2: 1 A2 B1
3: 1 A3 B1
4: 2 A1 B1
5: 2 A2 B1
6: 2 A3 B1
I haven't tested for more general cases, but this works for the OP example:
library(data.table)
as.data.table(nested_list)[, lapply(ABAB, as.data.table)[[1]], id]
# id A B
#1: 1 A1 B1
#2: 1 A2 B1
#3: 1 A3 B1
#4: 2 A1 B1
#5: 2 A2 B1
#6: 2 A3 B1
Or another option (which is probably faster, but is more verbose):
rbindlist(lapply(nested_list$ABAB, as.data.table),
idcol = 'id')[, id := nested_list$id[id]]
This is some super ugly base R, but produces the desired output.
Reduce(rbind, Map(function(x, y) setNames(data.frame(x, y), c("id", "A", "B")),
as.list(nested_list[[1]]),
lapply(unlist(nested_list[-1], recursive=FALSE),
function(x) Reduce(cbind, x))))
id A B
1 1 A1 B1
2 1 A2 B1
3 1 A3 B1
4 2 A1 B1
5 2 A2 B1
6 2 A3 B1
lapply takes the a list of two elements (each containing the A and B variables) extracted with unlist and recursive=FALSE. It returns a list of character matrices with the B variable filled in by recycling. A list of the individual id variables from as.list(nested_list[[1]]) and the lit of matrices are fed to Map which converts corresponding pairs to a data.frame and gives the columns the desired names and returns a list of data.frames. Finally, this list of data.frames is fed to Reduce, which rbinds the results to a single data.frame.
The final Reduce(rbind, could be replaced by data.tables rbindlist if desired.
Here's another hideous solution
max_length = max(unlist(lapply(nested_list, function(x) lapply(x, lengths))))
data.frame(id = do.call(c, lapply(nested_list$id, rep, max_length)),
do.call(rbind, lapply(nested_list$ABAB, function(x)
do.call(cbind, lapply(x, function(y) {
if(length(y) < max_length) {
rep(y, max_length)
} else {
y
}
})))))
# id A B
#1 1 A1 B1
#2 1 A2 B1
#3 1 A3 B1
#4 2 A1 B1
#5 2 A2 B1
#6 2 A3 B1
And one more, also inelegant- but I`d gone too far by the time I saw the other answers.
restructure <- function(nested_l) {
ids <- as.numeric(max(unlist(lapply(unlist(nested_l, recursive = FALSE), function(x){
lapply(x, length)
}))))
temp = data.frame(rep(nested_l$id, each = ids),
sapply(1:length(nested_l$id), function(x){
out <-unlist(lapply(nested_l[[2]], function(y){
return(y[x])
}))
}))
names(temp) <- c("id", unique(substring(unlist(nested_l[2]), first = 1, last = 1)))
return(temp)
}
> restructure(nested_list)
id A B
1 1 A1 B1
2 1 A2 B1
3 1 A3 B1
4 2 A1 B1
5 2 A2 B1
6 2 A3 B1
Joining the party:
library(tidyverse)
temp <- map(nested_list,~map(.x,~expand.grid(.x)))
df <- map_df(1:2,~cbind(temp$id[[.x]],temp$ABAB[[.x]]))
Var1 A B
1 1 A1 B1
2 1 A2 B1
3 1 A3 B1
4 2 A1 B1
5 2 A2 B1
6 2 A3 B1

expand.grid with separate variable for each column

I would like to achieve the following data.frame in R:
i1 i2 i3
1 A1 A2 A3
2 No A2 A3
3 A1 No A3
4 No No A3
5 A1 A2 No
6 No A2 No
7 A1 No No
8 No No No
In each column the variable can either be the concatenated string "A" and the column number or "No". The data.frame should contain all possible combinations.
My idea was to use expand.grid, but I don't know how to create the list dynamically. Or is there a better approach?
expand.grid(list(c("A1", "No"), c("A2", "No"), c("A3", "No")))
I guess you could create your own helper function, something like that
MyList <- function(n) expand.grid(lapply(paste0("A", seq_len(n)), c, "No"))
Then simply pass it the number of elements (e.g., 3)
MyList(3)
# Var1 Var2 Var3
# 1 A1 A2 A3
# 2 No A2 A3
# 3 A1 No A3
# 4 No No A3
# 5 A1 A2 No
# 6 No A2 No
# 7 A1 No No
# 8 No No No
Alternatively, you could also try data.tables CJ equivalent which should much more efficient than expand.grid for a big n
library(data.table)
DTCJ <- function(n) do.call(CJ, lapply(paste0("A", seq_len(n)), c, "No"))
DTCJ(3) # will return a sorted cross join
# V1 V2 V3
# 1: A1 A2 A3
# 2: A1 A2 No
# 3: A1 No A3
# 4: A1 No No
# 5: No A2 A3
# 6: No A2 No
# 7: No No A3
# 8: No No No
Another option is using Map with expand.grid
n <- 3
expand.grid(Map(c, paste0('A', seq_len(n)), 'NO'))
Or
expand.grid(as.data.frame(rbind(paste0('A', seq_len(n)),'NO')))
Another option, only using the most fundamental functions in R, is to use the indices:
df <- data.frame(V1 = c('A','A','A', 'A',rep('No',4)), V2 = c('A','A','No','No','A','A','No','No'), V3 = c('A','No','A','No','A','No','A','No'), stringsAsFactors = FALSE)
to get the row and col indices of the elements we need to change:
rindex <- which(df != 'No') %% nrow(df)
cindex <- ceiling(which(df != 'No')/nrow(df))
the solution is basically a one-liner:
df[matrix(c(rindex,cindex),ncol=2)] <- paste0(df[matrix(c(rindex,cindex),ncol=2)],cindex)
> df
V1 V2 V3
1 A1 A2 A3
2 A1 A2 No
3 A1 No A3
4 A1 No No
5 No A2 A3
6 No A2 No
7 No No A3
8 No No No

Does calculating correlation between two dataframes require a loop?

I have a set of large dataframes that look like A and B:
A <- data.frame(A1=c(1,2,3,4,5),B1=c(6,7,8,9,10),C1=c(11,12,13,14,15 ))
A1 B1 C1
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
5 5 10 15
B <- data.frame(A2=c(6,7,7,10,11),B2=c(2,1,3,8,11),C2=c(1,5,16,7,8))
A2 B2 C2
1 6 2 1
2 7 1 5
3 7 3 16
4 10 8 7
5 11 11 8
I would like to create a vector (C) that denotes the Pearson correlation between A1 & A2, B1 & B2, and C1 & C2. In this case, for example, those correlations are:
[1] 0.95 0.92 0.46
cor accepts two data.frames:
A<-data.frame(A1=c(1,2,3,4,5),B1=c(6,7,8,9,10),C1=c(11,12,13,14,15 ))
B<-data.frame(A2=c(6,7,7,10,11),B2=c(2,1,3,8,11),C2=c(1,5,16,7,8))
cor(A,B)
# A2 B2 C2
# A1 0.9481224 0.9190183 0.459588
# B1 0.9481224 0.9190183 0.459588
# C1 0.9481224 0.9190183 0.459588
diag(cor(A,B))
#[1] 0.9481224 0.9190183 0.4595880
Edit:
Here are some benchmarks:
Unit: microseconds
expr min lq median uq max neval
diag(cor(A, B)) 230.292 238.4225 243.0115 255.0295 352.955 100
mapply(cor, A, B) 267.076 281.5120 286.8030 299.5260 375.087 100
unlist(Map(cor, A, B)) 250.053 259.1045 264.5635 275.9035 1146.140 100
Edit2:
And some better benchmarks using
set.seed(42)
A <- as.data.frame(matrix(rnorm(10*n),ncol=n))
B <- as.data.frame(matrix(rnorm(10*n),ncol=n))
However, I should probably mention that these benchmarks strongly depend on the number of rows.
Edit3: Since I was asked for the benchmarking code, here it is.
b <- sapply(2^(1:12), function(n) {
set.seed(42)
A <- as.data.frame(matrix(rnorm(10*n),ncol=n))
B <- as.data.frame(matrix(rnorm(10*n),ncol=n))
require(microbenchmark)
res <- print(microbenchmark(
diag(cor(A,B)),
mapply(cor, A, B),
unlist(Map(cor,A,B)),
times=10
),unit="us")
res$median
})
b <- t(b)
matplot(x=1:12,log10(b),type="l",
ylab="log10(median [µs])",
xlab="log2(n)",col=1:3,lty=1)
legend("topleft", legend=c("diag(cor(A, B))",
"mapply(cor, A, B)",
"unlist(Map(cor,A,B))"),lty=1, col=1:3)
You can use friend of apply functions, Map, for that.
Map(function(x,y) cor(x,y),A,B)
$A1
[1] 0.9481224
$B1
[1] 0.9190183
$C1
[1] 0.459588
If you want the output as vector as suggested by #Jilber :
unlist(Map(function(x,y) cor(x,y),A,B))
A1 B1 C1
0.9481224 0.9190183 0.4595880
Or you can just use:
unlist(Map(cor,A,B))
A1 B1 C1
0.9481224 0.9190183 0.459588
Another alternative you can use mapply function
> mapply(function(x,y) cor(x,y),A,B)
A1 B1 C1
0.9481224 0.9190183 0.4595880
Or just mapply(cor, A, B) as suggested by #Aaron.

Resources