Does calculating correlation between two dataframes require a loop? - r

I have a set of large dataframes that look like A and B:
A <- data.frame(A1=c(1,2,3,4,5),B1=c(6,7,8,9,10),C1=c(11,12,13,14,15 ))
A1 B1 C1
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
5 5 10 15
B <- data.frame(A2=c(6,7,7,10,11),B2=c(2,1,3,8,11),C2=c(1,5,16,7,8))
A2 B2 C2
1 6 2 1
2 7 1 5
3 7 3 16
4 10 8 7
5 11 11 8
I would like to create a vector (C) that denotes the Pearson correlation between A1 & A2, B1 & B2, and C1 & C2. In this case, for example, those correlations are:
[1] 0.95 0.92 0.46

cor accepts two data.frames:
A<-data.frame(A1=c(1,2,3,4,5),B1=c(6,7,8,9,10),C1=c(11,12,13,14,15 ))
B<-data.frame(A2=c(6,7,7,10,11),B2=c(2,1,3,8,11),C2=c(1,5,16,7,8))
cor(A,B)
# A2 B2 C2
# A1 0.9481224 0.9190183 0.459588
# B1 0.9481224 0.9190183 0.459588
# C1 0.9481224 0.9190183 0.459588
diag(cor(A,B))
#[1] 0.9481224 0.9190183 0.4595880
Edit:
Here are some benchmarks:
Unit: microseconds
expr min lq median uq max neval
diag(cor(A, B)) 230.292 238.4225 243.0115 255.0295 352.955 100
mapply(cor, A, B) 267.076 281.5120 286.8030 299.5260 375.087 100
unlist(Map(cor, A, B)) 250.053 259.1045 264.5635 275.9035 1146.140 100
Edit2:
And some better benchmarks using
set.seed(42)
A <- as.data.frame(matrix(rnorm(10*n),ncol=n))
B <- as.data.frame(matrix(rnorm(10*n),ncol=n))
However, I should probably mention that these benchmarks strongly depend on the number of rows.
Edit3: Since I was asked for the benchmarking code, here it is.
b <- sapply(2^(1:12), function(n) {
set.seed(42)
A <- as.data.frame(matrix(rnorm(10*n),ncol=n))
B <- as.data.frame(matrix(rnorm(10*n),ncol=n))
require(microbenchmark)
res <- print(microbenchmark(
diag(cor(A,B)),
mapply(cor, A, B),
unlist(Map(cor,A,B)),
times=10
),unit="us")
res$median
})
b <- t(b)
matplot(x=1:12,log10(b),type="l",
ylab="log10(median [µs])",
xlab="log2(n)",col=1:3,lty=1)
legend("topleft", legend=c("diag(cor(A, B))",
"mapply(cor, A, B)",
"unlist(Map(cor,A,B))"),lty=1, col=1:3)

You can use friend of apply functions, Map, for that.
Map(function(x,y) cor(x,y),A,B)
$A1
[1] 0.9481224
$B1
[1] 0.9190183
$C1
[1] 0.459588
If you want the output as vector as suggested by #Jilber :
unlist(Map(function(x,y) cor(x,y),A,B))
A1 B1 C1
0.9481224 0.9190183 0.4595880
Or you can just use:
unlist(Map(cor,A,B))
A1 B1 C1
0.9481224 0.9190183 0.459588

Another alternative you can use mapply function
> mapply(function(x,y) cor(x,y),A,B)
A1 B1 C1
0.9481224 0.9190183 0.4595880
Or just mapply(cor, A, B) as suggested by #Aaron.

Related

Use the levels of a dataframe column to add a new column with an incrementing number unique to each level

I'm trying to create a new column in a dataframe that contains an incrementing number based on the levels of a different column. That is, I want to rename the levels of a column so that each level has a unique, incrementing number.
df <- data.frame(y1 = c(100, 100, 100, 200, 200, 500, 500, 500),
y2 = c(6, 5, 4, 2, 5, 4, 3, 2))
df$y1 <- as.factor(df$y1)
levels(df$y1) ## [1] "100" "200" "500"
Expected output: a new y3 column with new level names based on the levels of y1. The "b" isn't necessary, I can add that on later.
y1 y2 y3
100 6 b1
100 5 b1
100 4 b1
200 2 b2
200 5 b2
500 4 b3
500 3 b3
500 2 b3
I've messed around with lapply and various for loops, but I don't really know what I'm doing here... stuff like this:
for (i in levels(df$y1)){
batchnum <- 1
if (i == df$y1){
df$y3 <- paste0("b", batchnum)
batchnum <- batchnum + 1
}
}
This just labels y3 with "b1" for each row, I guess because if is not vectorized or something?
## Warning messages:
1: In if (i == df$y1) { :
the condition has length > 1 and only the first element will be used
Using data.table:
library(data.table)
setDT(df)
df[, y3 := .GRP, by = y1]
df[, y3 := paste0("b", y3)] # you can change "b" with whatever you want
y1 y2 y3
1: 100 6 b1
2: 100 5 b1
3: 100 4 b1
4: 200 2 b2
5: 200 5 b2
6: 500 4 b3
7: 500 3 b3
8: 500 2 b3
The most direct and simple approach (taking advantage of the fact that as.numeric will generate numbers corresponding to the factor levels):
df$y3 <- paste0('b', as.numeric(df$y1))
If it's not clear why this works, look at the following code on its own:
as.numeric(df$y1)
A dplyr approach:
require(dplyr);
df %>% mutate(y3 = paste0("b", as.numeric(y1)));
# y1 y2 y3
#1 100 6 b1
#2 100 5 b1
#3 100 4 b1
#4 200 2 b2
#5 200 5 b2
#6 500 4 b3
#7 500 3 b3
#8 500 2 b3
Or you also do:
df %>% mutate(y3 = paste0("b", cumsum(!duplicated(y1))));
# y1 y2 y3
#1 100 6 b1
#2 100 5 b1
#3 100 4 b1
#4 200 2 b2
#5 200 5 b2
#6 500 4 b3
#7 500 3 b3
#8 500 2 b3
Here's one way:
x <- c(100,100,100,200,200,500,500,500)
paste0("b",rep(seq_along(table(x)),table(x)))
[1] "b1" "b1" "b1" "b2" "b2" "b3" "b3" "b3"
One can use group_indices function from dplyr to create new column as:
library(dplyr)
df %>% mutate(y3 = paste0("b", group_indices(.,y1)))
# y1 y2 y3
#1 100 6 b1
#2 100 5 b1
#3 100 4 b1
#4 200 2 b2
#5 200 5 b2
#6 500 4 b3
#7 500 3 b3
#8 500 2 b3

cbind with partially nested list

I'm trying to cbind or unnest or as.data.table a partially nested list.
id <- c(1,2)
A <- c("A1","A2","A3")
B <- c("B1")
AB <- list(A=A,B=B)
ABAB <- list(AB,AB)
nested_list <- list(id=id,ABAB=ABAB)
The length of id is the same as ABAB (2 in this case). I don't know how to unlist a part of this list (ABAB) and cbind another part (id). Here's my desired result as a data.table:
data.table(id=c(1,1,1,2,2,2),A=c("A1","A2","A3","A1","A2","A3"),B=rep("B1",6))
id A B
1: 1 A1 B1
2: 1 A2 B1
3: 1 A3 B1
4: 2 A1 B1
5: 2 A2 B1
6: 2 A3 B1
I haven't tested for more general cases, but this works for the OP example:
library(data.table)
as.data.table(nested_list)[, lapply(ABAB, as.data.table)[[1]], id]
# id A B
#1: 1 A1 B1
#2: 1 A2 B1
#3: 1 A3 B1
#4: 2 A1 B1
#5: 2 A2 B1
#6: 2 A3 B1
Or another option (which is probably faster, but is more verbose):
rbindlist(lapply(nested_list$ABAB, as.data.table),
idcol = 'id')[, id := nested_list$id[id]]
This is some super ugly base R, but produces the desired output.
Reduce(rbind, Map(function(x, y) setNames(data.frame(x, y), c("id", "A", "B")),
as.list(nested_list[[1]]),
lapply(unlist(nested_list[-1], recursive=FALSE),
function(x) Reduce(cbind, x))))
id A B
1 1 A1 B1
2 1 A2 B1
3 1 A3 B1
4 2 A1 B1
5 2 A2 B1
6 2 A3 B1
lapply takes the a list of two elements (each containing the A and B variables) extracted with unlist and recursive=FALSE. It returns a list of character matrices with the B variable filled in by recycling. A list of the individual id variables from as.list(nested_list[[1]]) and the lit of matrices are fed to Map which converts corresponding pairs to a data.frame and gives the columns the desired names and returns a list of data.frames. Finally, this list of data.frames is fed to Reduce, which rbinds the results to a single data.frame.
The final Reduce(rbind, could be replaced by data.tables rbindlist if desired.
Here's another hideous solution
max_length = max(unlist(lapply(nested_list, function(x) lapply(x, lengths))))
data.frame(id = do.call(c, lapply(nested_list$id, rep, max_length)),
do.call(rbind, lapply(nested_list$ABAB, function(x)
do.call(cbind, lapply(x, function(y) {
if(length(y) < max_length) {
rep(y, max_length)
} else {
y
}
})))))
# id A B
#1 1 A1 B1
#2 1 A2 B1
#3 1 A3 B1
#4 2 A1 B1
#5 2 A2 B1
#6 2 A3 B1
And one more, also inelegant- but I`d gone too far by the time I saw the other answers.
restructure <- function(nested_l) {
ids <- as.numeric(max(unlist(lapply(unlist(nested_l, recursive = FALSE), function(x){
lapply(x, length)
}))))
temp = data.frame(rep(nested_l$id, each = ids),
sapply(1:length(nested_l$id), function(x){
out <-unlist(lapply(nested_l[[2]], function(y){
return(y[x])
}))
}))
names(temp) <- c("id", unique(substring(unlist(nested_l[2]), first = 1, last = 1)))
return(temp)
}
> restructure(nested_list)
id A B
1 1 A1 B1
2 1 A2 B1
3 1 A3 B1
4 2 A1 B1
5 2 A2 B1
6 2 A3 B1
Joining the party:
library(tidyverse)
temp <- map(nested_list,~map(.x,~expand.grid(.x)))
df <- map_df(1:2,~cbind(temp$id[[.x]],temp$ABAB[[.x]]))
Var1 A B
1 1 A1 B1
2 1 A2 B1
3 1 A3 B1
4 2 A1 B1
5 2 A2 B1
6 2 A3 B1

Group by in data.table in R which only keep non NA values from columns

I am new to R.
I want to group by a data.table and keep only non NA values from columns.
My table is look like:
c1 c2 c3 c4
1 A NA NA
1 NA B NA
1 NA NA C
2 A1 NA NA
2 NA B1 NA
2 NA NA C1
I want to have a result:
c1 c2 c3 c4
1 A B C
2 A1 B1 C1
Hope anyone can help!
Try
library(data.table)
setDT(df1)[, lapply(.SD, na.omit) , by = c1]
# c1 c2 c3 c4
#1: 1 A B C
#2: 2 A1 B1 C1
Or
setDT(df)[, lapply(.SD, function(x) x[!is.na(x)]) , by = c1]
I have checked 2 methods in #akrun answer and I found that method 2 is better.
Update: I also add function which uses complete.cases as #akrun suggestion.
f1 <- function (d) d[, lapply(.SD, na.omit) , by = c1]
f2 <- function (d) d[, lapply(.SD, function(x) x[!is.na(x)]) , by = c1]
f3 <- function (d) d[, lapply(.SD, function(x) x[complete.cases(x)]), by = c1]
microbenchmark(f1(copy(dt2)), f2(copy(dt2)), f3(copy(dt2)))
#Unit: milliseconds
# expr min lq mean median uq max neval
# f1(copy(dt2)) 124.22661 132.84712 138.00615 135.48418 140.18581 222.20735 100
# f2(copy(dt2)) 14.47915 16.37986 18.15728 17.35153 18.38754 28.72007 100
# f3(copy(dt2)) 22.10803 24.43208 27.63959 26.18713 31.58418 39.31601 100

expand.grid with separate variable for each column

I would like to achieve the following data.frame in R:
i1 i2 i3
1 A1 A2 A3
2 No A2 A3
3 A1 No A3
4 No No A3
5 A1 A2 No
6 No A2 No
7 A1 No No
8 No No No
In each column the variable can either be the concatenated string "A" and the column number or "No". The data.frame should contain all possible combinations.
My idea was to use expand.grid, but I don't know how to create the list dynamically. Or is there a better approach?
expand.grid(list(c("A1", "No"), c("A2", "No"), c("A3", "No")))
I guess you could create your own helper function, something like that
MyList <- function(n) expand.grid(lapply(paste0("A", seq_len(n)), c, "No"))
Then simply pass it the number of elements (e.g., 3)
MyList(3)
# Var1 Var2 Var3
# 1 A1 A2 A3
# 2 No A2 A3
# 3 A1 No A3
# 4 No No A3
# 5 A1 A2 No
# 6 No A2 No
# 7 A1 No No
# 8 No No No
Alternatively, you could also try data.tables CJ equivalent which should much more efficient than expand.grid for a big n
library(data.table)
DTCJ <- function(n) do.call(CJ, lapply(paste0("A", seq_len(n)), c, "No"))
DTCJ(3) # will return a sorted cross join
# V1 V2 V3
# 1: A1 A2 A3
# 2: A1 A2 No
# 3: A1 No A3
# 4: A1 No No
# 5: No A2 A3
# 6: No A2 No
# 7: No No A3
# 8: No No No
Another option is using Map with expand.grid
n <- 3
expand.grid(Map(c, paste0('A', seq_len(n)), 'NO'))
Or
expand.grid(as.data.frame(rbind(paste0('A', seq_len(n)),'NO')))
Another option, only using the most fundamental functions in R, is to use the indices:
df <- data.frame(V1 = c('A','A','A', 'A',rep('No',4)), V2 = c('A','A','No','No','A','A','No','No'), V3 = c('A','No','A','No','A','No','A','No'), stringsAsFactors = FALSE)
to get the row and col indices of the elements we need to change:
rindex <- which(df != 'No') %% nrow(df)
cindex <- ceiling(which(df != 'No')/nrow(df))
the solution is basically a one-liner:
df[matrix(c(rindex,cindex),ncol=2)] <- paste0(df[matrix(c(rindex,cindex),ncol=2)],cindex)
> df
V1 V2 V3
1 A1 A2 A3
2 A1 A2 No
3 A1 No A3
4 A1 No No
5 No A2 A3
6 No A2 No
7 No No A3
8 No No No

3D array -> apply -> 3D array

It seems apply will not re-assemble 3D arrays when operating on just one margin. Consider:
arr <- array(
runif(2*4*3),
dim=c(2, 4, 3),
dimnames=list(a=paste0("a", 1:2), b=paste0("b", 1:4), c=paste0("c", 1:3))
)
# , , c = c1
#
# b
# a b1 b2 b3 b4
# a1 0.7321399 0.8851802 0.2469866 0.9307044
# a2 0.5896138 0.6183046 0.7732842 0.6652637
#
# , , c = c2
# b
# a b1 b2 b3 b4
# a1 0.5894680 0.7839048 0.3854357 0.56555024
# a2 0.6158995 0.6530224 0.8401427 0.04044974
#
# , , c = c3
# b
# a b1 b2 b3 b4
# a1 0.3500653 0.7052743 0.42487635 0.5689287
# a2 0.4097346 0.4527939 0.07192528 0.8638655
Now, make a 4 x 4 matrix to shuffle columns around in each of arr[, , i], and use apply to matrix multiply each a*b sub-matrix in arr to re-order their columns. The important point is that the result of each apply iteration is a matrix
cols.shuf.mx <- matrix(c(0,1,0,0,1,0,0,0,0,0,0,1,0,0,1,0), ncol=4)
apply(arr, 3, `%*%`, cols.shuf.mx)
# c
# c1 c2 c3
# [1,] 0.8851802 0.78390483 0.70527431
# [2,] 0.6183046 0.65302236 0.45279387
# [3,] 0.7321399 0.58946800 0.35006532
# [4,] 0.5896138 0.61589947 0.40973463
# [5,] 0.9307044 0.56555024 0.56892870
# [6,] 0.6652637 0.04044974 0.86386552
# [7,] 0.2469866 0.38543569 0.42487635
# [8,] 0.7732842 0.84014275 0.07192528
Whereas, I expected the result to be:
# , , c = c1
#
# a 1 2 3 4
# a1 0.8851802 0.7321399 0.9307044 0.2469866
# a2 0.6183046 0.5896138 0.6652637 0.7732842
#
# , , c = c2
#
# a 1 2 3 4
# a1 0.7839048 0.5894680 0.56555024 0.3854357
# a2 0.6530224 0.6158995 0.04044974 0.8401427
#
# , , c = c3
#
# a 1 2 3 4
# a1 0.7052743 0.3500653 0.5689287 0.42487635
# a2 0.4527939 0.4097346 0.8638655 0.07192528
I can get the expected result with plyr::aaply with:
aperm(aaply(arr, 3, `%*%`, cols.shuf.mx), c(2, 3, 1))
but was wondering if there is a simple base way to achieve this result (i.e. am I missing something obvious here to get the desired outcome).
I realize what occurs here is what is documented (If each call to FUN returns a vector of length n, then apply returns an array of dimension c(n, dim(X)[MARGIN]) if n > 1), but it still seems weird to me that if a function returns an object with dimensions they are basically ignored.
Here is a less than fantastic solution that requires foreknowledge of the dimensions of the function result matrix:
vapply(
1:dim(arr)[3],
function(x, y) arr[,,x] %*% y,
FUN.VALUE=arr[,,1],
y=cols.shuf.mx
)
If you read the help page for apply, it basically agrees with your first sentence. It is set up with a particular design and you would need to construct a new function to do something differently. BTW: This gives you the same result much more simply than that aperm(aaply(...)) rigamarole:
arr[ , c(2,1,4,3) , ]
#-------------------------
, , c = c1
b
a b2 b1 b4 b3
a1 0.4089769 0.2875775 0.5281055 0.9404673
a2 0.8830174 0.7883051 0.8924190 0.0455565
, , c = c2
b
a b2 b1 b4 b3
a1 0.9568333 0.5514350 0.1029247 0.6775706
a2 0.4533342 0.4566147 0.8998250 0.5726334
, , c = c3
b
a b2 b1 b4 b3
a1 0.3279207 0.24608773 0.6405068 0.8895393
a2 0.9545036 0.04205953 0.9942698 0.6928034

Resources