Averaging row and column cells from multiple data frames - r

I have multiple data frames, like:
DG = data.frame(y=c(1,3), v=3:8, x=c(4,6))
DF = data.frame(y=c(1,3), v=3:8, x=c(12,14))
DT = data.frame(y=c(1,3), v=3:8, x=c(4,5))
head(DG)
y v x
1 1 3 4
2 3 4 6
3 1 5 4
4 3 6 6
5 1 7 4
6 3 8 6
head(DT)
y v x
1 1 3 4
2 3 4 5
3 1 5 4
4 3 6 5
5 1 7 4
6 3 8 5
head(DF)
y v x
1 1 3 12
2 3 4 12
3 1 5 12
4 3 6 12
5 1 7 12
6 3 8 12
I want to calculate means of each 'row' but from each column of each data frame, i.e. the resulting data frame I need looks like:
y v x
1 'mean(DG(y1)DT(y1),DF(y1))' 'mean(DG(v1)DT(v1),DF(v1))' 'mean(DG(x1)DT(x1),DF(x1))'
2 'mean(DG(y2)DT(y2),DF(y2))' 'mean(DG(v2)DT(v2),DF(v2))' 'mean(DG(x2)DT(x2),DF(x2))'
3 'mean(DG(y3)DT(y3),DF(y3))' 'mean(DG(v3)DT(v3),DF(v3))' 'mean(DG(x3)DT(x3),DF(x3))'
....
In reality, y, v and x are different locations and 1 - 6 time steps. I want to average my data for each time step and location. Eventually, I need one data set, that looks like one of the example data sets, but with averaged values in each cell.
I have a working example with loops, but for large datasets it is very slow, so I tried various combinations with apply and rowSums, but neither worked out.

If I understand correctly, there are many data frames which all have the same structure (number, name and type of columns) as well as the same number of rows (time steps). Some data points may contain NA.
The code below creates a large data.table from the single data frames and computes the mean values for each time step and location across the different data frames:
library(data.table)
rbindlist(list(DG, DF, DT), idcol = TRUE)[
, lapply(.SD, mean, na.rm = TRUE), by = .(time_step = rowid(.id))]
time_step y v x
1: 1 1 3 6.666667
2: 2 3 4 8.333333
3: 3 1 5 6.666667
4: 4 3 6 8.333333
5: 5 1 7 6.666667
6: 6 3 8 8.333333
This will work also with NAs, e.g.,
DG = data.frame(y=c(1,3), v=3:8, x=c(4,6))
DF = data.frame(y=c(1,3), v=3:8, x=c(12,14))
DT = data.frame(y=c(1,3), v=3:8, x=c(4,5,NA))
Note that column x of DT has been modified
rbindlist(list(DG, DF, DT), idcol = TRUE)[
, lapply(.SD, mean, na.rm = TRUE), by = .(time_step = rowid(.id))]
time_step y v x
1: 1 1 3 6.666667
2: 2 3 4 8.333333
3: 3 1 5 8.000000
4: 4 3 6 8.000000
5: 5 1 7 7.000000
6: 6 3 8 10.000000
Note that x in rows 3 and 6 has changed.

If you only have the three data frames, I would recommend
result = (DG + DT + DF) / 3
result
# y v x
# 1 1 3 6.666667
# 2 3 4 8.333333
# 3 1 5 6.666667
# 4 3 6 8.333333
# 5 1 7 6.666667
# 6 3 8 8.333333
This assumes that your rows and columns are already in the correct order.
If you have more data frames, put them in a list (see here for help with that) and then you can do this:
result = Reduce("+", list_of_data) / length(list_of_data)
If you need advanced features of mean, like ignoring NAs or trimming, this won't work. Instead, I would recommend using converting your data frames to matrices, stacking them into an 3-d array, and applying mean.
library(abind)
stack = abind(DG, DF, DT, along = 3)
# if you have data frames in a list, do this instead:
# stack = do.call(abind, c(list_of_data, along = 3))
apply(stack, MARGIN = 1:2, FUN = mean, na.rm = TRUE)
# y v x
# [1,] 1 3 6.666667
# [2,] 3 4 8.333333
# [3,] 1 5 6.666667
# [4,] 3 6 8.333333
# [5,] 1 7 6.666667
# [6,] 3 8 8.333333
The final method I'll recommend is a "tidy" method - combine your data into one data frame and use grouped operations to produce the result. This can be done easily with data.table or dplyr. See Uwe's answer for a nice data.table implementation.
library(dplyr)
bind_rows(list(DG, DF, DT), .id = ".id") %>%
group_by(.id) %>%
mutate(rn = row_number()) %>%
ungroup() %>%
select(-.id) %>%
group_by(rn) %>%
summarize_all(mean, na.rm = TRUE) %>%
select(-rn)
# # A tibble: 6 x 3
# y v x
# <dbl> <dbl> <dbl>
# 1 1 3 6.67
# 2 3 4 8.33
# 3 1 5 6.67
# 4 3 6 8.33
# 5 1 7 6.67
# 6 3 8 8.33

Related

aggregate on multiple columns - keeping the original column names and structure

please consider the following example which makes use of aggregate twice.
library(dplyr)
set.seed(5)
x <- data.frame(
name = sample(c('NM01', 'NM02', 'NM03', 'NM04', 'NM05'), 400, replace = TRUE),
strand = sample(c('+', '-'), 400, replace = TRUE),
value = sample(6, 400, replace = TRUE)
)
x_agg_hist <- aggregate( x$value,
by = list(strand = x$strand,
transcript = x$name
),
function(v) hist( v,
breaks = seq(0.5, 6.5),
plot= FALSE
)$counts
)
y <- data.frame(
name = c('NM01', 'NM02', 'NM03', 'NM04', 'NM05'),
value = runif(5)
)
x_agg_hist$value <- y$value[match(x_agg_hist$transcript, y$name)]
x_agg_hist$division <- ifelse(x_agg_hist$value > 0.5, 1, 2) %>% as.factor()
x_agg_hist
strand transcript x.1 x.2 x.3 x.4 x.5 x.6 value division
1 - NM01 6 9 8 5 5 8 0.5661267 1
2 + NM01 4 2 8 8 8 6 0.5661267 1
3 - NM02 8 4 6 5 3 11 0.1178577 2
4 + NM02 7 6 9 8 7 7 0.1178577 2
5 - NM03 4 5 10 4 6 3 0.2572855 2
6 + NM03 6 10 5 9 5 9 0.2572855 2
7 - NM04 7 4 5 7 4 9 0.9678125 1
8 + NM04 4 3 4 10 8 9 0.9678125 1
9 - NM05 4 6 10 5 5 5 0.8891210 1
10 + NM05 11 13 5 8 12 8 0.8891210 1
So far, everything is fine. Specifically, I notice that I can select the columns of the histograms created by aggregate "collectively" using
x_agg_hist$x
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 6 9 8 5 5 8
[2,] 4 2 8 8 8 6
[3,] 8 4 6 5 3 11
[4,] 7 6 9 8 7 7
[5,] 4 5 10 4 6 3
[6,] 6 10 5 9 5 9
Next, I would like to sum the histograms by 'division' and 'strand' (and normalise by the number of observations in each group).
x_agg_hist_agg_sum <- aggregate( x_agg_hist$x,
by = list(division = x_agg_hist$division,
strand = x_agg_hist$strand
),
function(v) sum(v)/length(v)
)
Note that using x_agg_hist$x to select all the columns of the histograms seems a lot more convenient than what has been proposed here (Aggregate / summarize multiple variables per group (e.g. sum, mean)).
This still works as expected.
x_agg_hist_agg_sum
division strand V1 V2 V3 V4 V5 V6
1 1 - 5.666667 6.333333 7.666667 5.666667 4.666667 7.333333
2 2 - 6.000000 4.500000 8.000000 4.500000 4.500000 7.000000
3 1 + 6.333333 6.000000 5.666667 8.666667 9.333333 7.666667
4 2 + 6.500000 8.000000 7.000000 8.500000 6.000000 8.000000
However, now aggregate has renamed the columns of the (summed) histograms in a way that does not allow selecting them collectively any more. Therefore, I was wondering if it was possible to tell aggregate to keep the original column names and structure or if there is any other method that can do so. (Of course I know that I can use x_agg_hist_agg_sum[, -c(1, 2)], but with my real data (after a lot of further processing) this would at least be a lot more difficult.)
Cheers,
mce1
I would suggest to use dplyr for such long chained operations. There are lot of benefits with it.
You can do all the transformation/manipulation and reshaping code with it in the single pipe without creating intermediate variables like x_agg_hist and x_agg_hist_agg_sum. So you don't have to remember/manage them.
The first few steps of your code code can be translated as :
library(dplyr)
x %>%
group_by(strand, name) %>%
summarise(res = hist(value, breaks = seq(0.5, 6.5),plot= FALSE)$counts) %>%
left_join(y, by = 'name') %>%
mutate(division = factor(ifelse(value > 0.5, 1, 2))) %>%
ungroup
Use pivot_wider to cast the data into wide format which will maintain the names of the data.

Combining elements of one column into two columns by group in R

Given a two column data.frame with one containing group labels and a second containing integer values ordered from smallest to largest. How can the data be expanded creating pairs of combinations of the integer column?
Not sure the best way to state this. I'm not interested in all possible combinations but instead all unique combinations starting from the lowest value.
In r, the combn function gives the desired output not considering groups, for example:
t(combn(seq(1:4),2))
[,1] [,2]
[1,] 1 2
[2,] 1 3
[3,] 1 4
[4,] 2 3
[5,] 2 4
[6,] 3 4
Since the first values is 1 we get the unique combination of (1,2) and not the additional combination of (2,1) which I don't need. How would one then apply a similar method by groups?
for example given a data.frame
test <- data.frame(Group = rep(c("A","B"),each=4),
Val = c(1,3,6,8,2,4,5,7))
test
Group Val
1 A 1
2 A 3
3 A 6
4 A 8
5 B 2
6 B 4
7 B 5
8 B 7
I was able to come up with this solution that gives the desired output:
test <- data.frame(Group = rep(c("A","B"),each=4),
Val = c(1,3,6,8,2,4,5,7))
j=1
for(i in unique(test$Group)){
if(j==1){
one <- filter(test,i == Group)
two <- data.frame(t(combn(one$Val,2)))
test1 <- data.frame(Group = i,Val1=two$X1,Val2=two$X2)
j=j+1
}else{
one <- filter(test,i == Group)
two <- data.frame(t(combn(one$Val,2)))
test2 <- data.frame(Group = i,Val1=two$X1,Val2=two$X2)
test1 <- rbind(test1,test2)
}
}
test1
Group Val1 Val2
1 A 1 3
2 A 1 6
3 A 1 8
4 A 3 6
5 A 3 8
6 A 6 8
7 B 2 4
8 B 2 5
9 B 2 7
10 B 4 5
11 B 4 7
12 B 5 7
However, this is not elegant and is really slow as the number of groups and length of each group become large. It seems like there should be a more elegant and efficient solution but so far I have not come across anything on SO.
I would appreciate any ideas!
here is a data.table approach
library( data.table )
#make test a data.table
setDT(test)
#split by group
L <- split( test, by = "Group")
#get unique combinations of 2 Vals
L2 <- lapply( L, function(x) {
as.data.table( t( combn( x$Val, m = 2, simplify = TRUE ) ) )
})
#merge them back together
data.table::rbindlist( L2, idcol = "Group" )
# Group V1 V2
# 1: A 1 3
# 2: A 1 6
# 3: A 1 8
# 4: A 3 6
# 5: A 3 8
# 6: A 6 8
# 7: B 2 4
# 8: B 2 5
# 9: B 2 7
#10: B 4 5
#11: B 4 7
#12: B 5 7
You can set simplify = F in combn() and then use unnest_wider() in dplyr.
library(dplyr)
library(tidyr)
test %>%
group_by(Group) %>%
summarise(Val = combn(Val, 2, simplify = F)) %>%
unnest_wider(Val, names_sep = "_")
# Group Val_1 Val_2
# <chr> <dbl> <dbl>
# 1 A 1 3
# 2 A 1 6
# 3 A 1 8
# 4 A 3 6
# 5 A 3 8
# 6 A 6 8
# 7 B 2 4
# 8 B 2 5
# 9 B 2 7
# 10 B 4 5
# 11 B 4 7
# 12 B 5 7
library(tidyverse)
df2 <- split(df$Val, df$Group) %>%
map(~gtools::combinations(n = 4, r = 2, v = .x)) %>%
map(~as_tibble(.x, .name_repair = "unique")) %>%
bind_rows(.id = "Group")

reshaping data with time represented as spells

I have a dataset in which time is represented as spells (i.e. from time 1 to time 2), like this:
d <- data.frame(id = c("A","A","B","B","C","C"),
t1 = c(1,3,1,3,1,3),
t2 = c(2,4,2,4,2,4),
value = 1:6)
I want to reshape this into a panel dataset, i.e. one row for each unit and time period, like this:
result <- data.frame(id = c("A","A","A","A","B","B","B","B","C","C","C","C"),
t= c(1:4,1:4,1:4),
value = c(1,1,2,2,3,3,4,4,5,5,6,6))
I am attempting to do this with tidyr and gather but not getting the desired result. I am trying something like this which is clearly wrong:
gather(d, 't1', 't2', key=t)
In the actual dataset the spells are irregular.
You were almost there.
Code
d %>%
# Gather the needed variables. Explanation:
# t_type: How will the call the column where we will put the former
# variable names under?
# t: How will we call the column where we will put the
# values of above variables?
# -id,
# -value: Which columns should stay the same and NOT be gathered
# under t_type (key) and t (value)?
#
gather(t_type, t, -id, -value) %>%
# Select the right columns in the right order.
# Watch out: We did not select t_type, so it gets dropped.
select(id, t, value) %>%
# Arrange / sort the data by the following columns.
# For a descending order put a "-" in front of the column name.
arrange(id, t)
Result
id t value
1 A 1 1
2 A 2 1
3 A 3 2
4 A 4 2
5 B 1 3
6 B 2 3
7 B 3 4
8 B 4 4
9 C 1 5
10 C 2 5
11 C 3 6
12 C 4 6
So, the goal is to melt t1 and t2 columns and to drop the key column that will appear as a result. There are a couple of options. Base R's reshape seems to be tedious. We may, however, use melt:
library(reshape2)
melt(d, measure.vars = c("t1", "t2"), value.name = "t")[-3]
# id value t
# 1 A 1 1
# 2 A 2 3
# 3 B 3 1
# 4 B 4 3
# 5 C 5 1
# 6 C 6 3
# 7 A 1 2
# 8 A 2 4
# 9 B 3 2
# 10 B 4 4
# 11 C 5 2
# 12 C 6 4
where -3 drop the key column. We may indeed also use gather as in
gather(d, "key", "t", t1, t2)[-3]
# id value t
# 1 A 1 1
# 2 A 2 3
# 3 B 3 1
# 4 B 4 3
# 5 C 5 1
# 6 C 6 3
# 7 A 1 2
# 8 A 2 4
# 9 B 3 2
# 10 B 4 4
# 11 C 5 2
# 12 C 6 4

R data.table with variable number of columns

For each student in a data set, a certain set of scores may have been collected. We want to calculate the mean for each student, but using only the scores in the columns that were germane to that student.
The columns required in a calculation are different for each row. I've figured how to write this in R using the usual tools, but am trying to rewrite with data.table, partly for fun, but also partly in anticipation of success in this small project which might lead to the need to make calculations for lots and lots of rows.
Here is a small working example of "choose a specific column set for each row problem."
set.seed(123234)
## Suppose these are 10 students in various grades
dat <- data.frame(id = 1:10, grade = rep(3:7, by = 2),
A = sample(c(1:5, 9), 10, replace = TRUE),
B = sample(c(1:5, 9), 10, replace = TRUE),
C = sample(c(1:5, 9), 10, replace = TRUE),
D = sample(c(1:5, 9), 10, replace = TRUE))
## 9 is a marker for missing value, there might also be
## NAs in real data, and those are supposed to be regarded
## differently in some exercises
## Students in various grades are administered different
## tests. A data structure gives the grade to test linkage.
## The letters are column names in dat
lookup <- list("3" = c("A", "B"),
"4" = c("A", "C"),
"5" = c("B", "C", "D"),
"6" = c("A", "B", "C", "D"),
"7" = c("C", "D"),
"8" = c("C"))
## wrapper around that lookup because I kept getting confused
getLookup <- function(grade){
lookup[[as.character(grade)]]
}
## Function that receives one row (named vector)
## from data frame and chooses columns and makes calculation
getMean <- function(arow, lookup){
scores <- arow[getLookup(arow["grade"])]
mean(scores[scores != 9], na.rm = TRUE)
}
stuscores <- apply(dat, 1, function(x) getMean(x, lookup))
result <- data.frame(dat, stuscores)
result
## If the data is 1000s of thousands of rows,
## I will wish I could use data.table to do that.
## Client will want students sorted by state, district, classroom,
## etc.
## However, am stumped on how to specify the adjustable
## column-name chooser
library(data.table)
DT <- data.table(dat)
## How to write call to getMean correctly?
## Want to do this for each participant (no grouping)
setkey(DT, id)
The desired output is the student average for the appropriate columns, like so:
> result
id grade A B C D stuscores
1 1 3 9 9 1 4 NaN
2 2 4 5 4 1 5 3.0
3 3 5 1 3 5 9 4.0
4 4 6 5 2 4 5 4.0
5 5 7 9 1 1 3 2.0
6 6 3 3 3 4 3 3.0
7 7 4 9 2 9 2 NaN
8 8 5 3 9 2 9 2.0
9 9 6 2 3 2 5 3.0
10 10 7 3 2 4 1 2.5
Then what? I've written a lot of mistakes so far...
I did not find any examples in the data table examples in which the columns to be used in calculations for each row was itself a variable, I thank you for your advice.
I was not asking anybody to write code for me, I'm asking for advice on how to get started with this problem.
First of all, when creating a reproducible example using functions such as sample (which set a random seed each time you run it), you should use set.seed.
Second of all, instead of looping over each row, you could just loop over the lookup list which will always be smaller than the data (many times significantly smaller) and combine it with rowMeans. You can also do it with base R, but you asked for a data.table solution so here goes (for the purposes of this solution I've converted all 9 to NAs, but you can try to generalize this to your specific case too)
So using set.seed(123), your function gives
apply(dat, 1, function(x) getMean(x, lookup))
# [1] 2.000000 5.000000 4.666667 4.500000 2.500000 1.000000 4.000000 2.333333 2.500000 1.500000
And here's a possible data.table application which runs only over the lookup list (for loops on lists are very efficient in R btw, see here)
## convert all 9 values to NAs
is.na(dat) <- dat == 9L
## convert your original data to `data.table`,
## there is no need in additional copy of the data if the data is huge
setDT(dat)
## loop only over the list
for(i in names(lookup)) {
dat[grade == i, res := rowMeans(as.matrix(.SD[, lookup[[i]], with = FALSE]), na.rm = TRUE)]
}
dat
# id grade A B C D res
# 1: 1 3 2 NA NA NA 2.000000
# 2: 2 4 5 3 5 NA 5.000000
# 3: 3 5 3 5 4 5 4.666667
# 4: 4 6 NA 4 NA 5 4.500000
# 5: 5 7 NA 1 4 1 2.500000
# 6: 6 3 1 NA 5 3 1.000000
# 7: 7 4 4 2 4 5 4.000000
# 8: 8 5 NA 1 4 2 2.333333
# 9: NA 6 4 2 2 2 2.500000
# 10: 10 7 3 NA 1 2 1.500000
Possibly, this could be improved utilizing set, but I can't think of a good way currently.
P.S.
As suggested by #Arun, please take a look at the vignettes he himself wrote here in order to get familiar with the := operator, .SD, with = FALSE, etc.
Here's another data.table approach using melt.data.table (needs data.table 1.9.5+) and then joins between data.tables:
DT_m <- setkey(melt.data.table(DT, c("id", "grade"), value.name = "score"), grade, variable)
lookup_dt <- data.table(grade = rep(as.integer(names(lookup)), lengths(lookup)),
variable = unlist(lookup), key = "grade,variable")
score_summary <- setkey(DT_m[lookup_dt, nomatch = 0L,
.(res = mean(score[score != 9], na.rm = TRUE)), by = id], id)
setkey(DT, id)[score_summary, res := res]
# id grade A B C D mean_score
# 1: 1 3 9 9 1 4 NaN
# 2: 2 4 5 4 1 5 3.0
# 3: 3 5 1 3 5 9 4.0
# 4: 4 6 5 2 4 5 4.0
# 5: 5 7 9 1 1 3 2.0
# 6: 6 3 3 3 4 3 3.0
# 7: 7 4 9 2 9 2 NaN
# 8: 8 5 3 9 2 9 2.0
# 9: 9 6 2 3 2 5 3.0
#10: 10 7 3 2 4 1 2.5
It's more verbose, but just over twice as fast:
microbenchmark(da_method(), nk_method(), times = 1000)
#Unit: milliseconds
# expr min lq mean median uq max neval
# da_method() 17.465893 17.845689 19.249615 18.079206 18.337346 181.76369 1000
# nk_method() 7.047405 7.282276 7.757005 7.489351 7.667614 20.30658 1000

Correlations by grouping twice in R, using dplyR or aggregate?

My (toy) data looks like:
Item_Id Location_Id date price
1 A 5372 1 .5
2 A 5372 2 NA
3 A 5372 3 1
4 A 6065 1 1
5 A 6065 2 1
6 A 6065 3 3
7 A 7000 1 NA
8 A 7000 2 NA
9 A 7000 3 NA
10 B 5372 1 3
11 B 5372 2 NA
12 B 5372 3 1
13 B 6065 1 2
14 B 6065 2 1
15 B 6065 3 3
16 B 7000 1 8
17 B 7000 2 NA
18 B 7000 3 9
In reality there are hundreds of unique item_Ids and location_Ids.
Data
Item_Id=c(rep('A',9),rep('B',9))
Location_Id=rep(c(rep(5372,3),rep(6065,3),rep(7000,3)),2)
date = rep(1:3,6)
price = c(0.5,NA,1,1,1,3,NA,NA,NA,3,NA,1,2,1,3,8,NA,9)
df = data.frame(Item_Id,Location_Id,date,price)
I want to ultimate get the median correlation (over locations) of the prices series for every item with every other item. I tried writing a loop in the hopes that it would be quick (not finished):
for(item in items){
remainingitems = items[items!=item]
for(item2 in remainingitems){
cortemp = numeric(0)
for(locat in locations){
print(locat)
a = pricepanel[pricepanel$Item_Id==item &
pricepanel$Location_Id==locat,]$price
b = pricepanel[pricepanel$Item_Id==item2 &
pricepanel$Location_Id==locat,]$price
cortemp=c(cortemp,cor(cbind(a,b), use="pairwise.complete.obs")[2])
}
}
But I stopped because it was much too slow. The most inner loop took several minutes alone and there are hundreds of stores and items. Basically I want to get the correlation matrix (every product with every other product) for every location, and then take the element-wise median across those matrices.
I expect there is an efficient way to do this, but I am new to this kind of thing in R. I tried reading dplyr since I suspect the solution lies in there, but I got stuck.
The interim output would be something like:
$5752
A B
A 1 -1
B -1 1
$6065
A B
A 1 0.8660254
B 0.8660254 1
$7000
A B
A 1 NA
B NA 1
Then the final would take the elementwise median of all those location matrices.
Final:
A B
A 1 -.0669873
B -.0669873 1
You could get the "interim" output using dplyr and tidyr:
library(dplyr)
library(tidyr)
cors <- df %>% spread(Item_Id, price) %>%
group_by(Location_Id) %>%
do(correlation = cor(.[, -(1:2)], use = "pairwise.complete.obs"))
The way that this works is that the spread function (from tidyr) spreads the As, Bs, Cs etc into their own columns:
df %>% spread(Item_Id, price)
# Location_Id date A B
# 1 5372 1 0.5 3
# 2 5372 2 NA NA
# 3 5372 3 1.0 1
# 4 6065 1 1.0 2
# 5 6065 2 1.0 1
# 6 6065 3 3.0 3
# 7 7000 1 NA 8
# 8 7000 2 NA NA
# 9 7000 3 NA 9
(This should work with any number of "Items"- A, B, C, D...) The group_by(Location_Id) function then tells the code to operate within each location. Finally the do command tells it to find the correlation of the columns within each group (. is a placeholder for "the data within each group"), while ignoring the first two columns, Location_Id and date.
The above code produces a result that looks like:
# Source: local data frame [3 x 2]
# Groups: <by row>
#
# Location_Id correlation
# 1 5372 <dbl[2,2]>
# 2 6065 <dbl[2,2]>
# 3 7000 <dbl[2,2]>
The correlation column is a list of your three within-location matrices. At that point you can use the solution in this question to take the elementwise median:
apply(simplify2array(cors$correlation), c(1,2), median, na.rm = TRUE)
Here's a possible split apply solution using base R
lapply(split(df[, c("Item_Id", "price")], df$Location_Id),
function(x) {
cor(matrix(x$price, nrow = nrow(x)/length(unique(x$Item_Id))), use ="pairwise.complete.obs")
} )
# $`5372`
# [,1] [,2]
# [1,] 1 -1
# [2,] -1 1
#
# $`6065`
# [,1] [,2]
# [1,] 1.0000000 0.8660254
# [2,] 0.8660254 1.0000000
#
# $`7000`
# [,1] [,2]
# [1,] NA NA
# [2,] NA 1
And here's a similar solution to #Davids using data.table package
library(data.table)
DT <- dcast.data.table(as.data.table(df),
Location_Id + date ~ Item_Id,
value.var = "price")[, -2, with = FALSE]
Res <- DT[, .(Res = list(cor(.SD, use = "pairwise.complete.obs"))), Location_Id]
You can then view the cor matrices using
Res$Res
# [[1]]
# A B
# A 1 -1
# B -1 1
#
# [[2]]
# A B
# A 1.0000000 0.8660254
# B 0.8660254 1.0000000
#
# [[3]]
# A B
# A NA NA
# B NA 1

Resources