R data.table with variable number of columns - r

For each student in a data set, a certain set of scores may have been collected. We want to calculate the mean for each student, but using only the scores in the columns that were germane to that student.
The columns required in a calculation are different for each row. I've figured how to write this in R using the usual tools, but am trying to rewrite with data.table, partly for fun, but also partly in anticipation of success in this small project which might lead to the need to make calculations for lots and lots of rows.
Here is a small working example of "choose a specific column set for each row problem."
set.seed(123234)
## Suppose these are 10 students in various grades
dat <- data.frame(id = 1:10, grade = rep(3:7, by = 2),
A = sample(c(1:5, 9), 10, replace = TRUE),
B = sample(c(1:5, 9), 10, replace = TRUE),
C = sample(c(1:5, 9), 10, replace = TRUE),
D = sample(c(1:5, 9), 10, replace = TRUE))
## 9 is a marker for missing value, there might also be
## NAs in real data, and those are supposed to be regarded
## differently in some exercises
## Students in various grades are administered different
## tests. A data structure gives the grade to test linkage.
## The letters are column names in dat
lookup <- list("3" = c("A", "B"),
"4" = c("A", "C"),
"5" = c("B", "C", "D"),
"6" = c("A", "B", "C", "D"),
"7" = c("C", "D"),
"8" = c("C"))
## wrapper around that lookup because I kept getting confused
getLookup <- function(grade){
lookup[[as.character(grade)]]
}
## Function that receives one row (named vector)
## from data frame and chooses columns and makes calculation
getMean <- function(arow, lookup){
scores <- arow[getLookup(arow["grade"])]
mean(scores[scores != 9], na.rm = TRUE)
}
stuscores <- apply(dat, 1, function(x) getMean(x, lookup))
result <- data.frame(dat, stuscores)
result
## If the data is 1000s of thousands of rows,
## I will wish I could use data.table to do that.
## Client will want students sorted by state, district, classroom,
## etc.
## However, am stumped on how to specify the adjustable
## column-name chooser
library(data.table)
DT <- data.table(dat)
## How to write call to getMean correctly?
## Want to do this for each participant (no grouping)
setkey(DT, id)
The desired output is the student average for the appropriate columns, like so:
> result
id grade A B C D stuscores
1 1 3 9 9 1 4 NaN
2 2 4 5 4 1 5 3.0
3 3 5 1 3 5 9 4.0
4 4 6 5 2 4 5 4.0
5 5 7 9 1 1 3 2.0
6 6 3 3 3 4 3 3.0
7 7 4 9 2 9 2 NaN
8 8 5 3 9 2 9 2.0
9 9 6 2 3 2 5 3.0
10 10 7 3 2 4 1 2.5
Then what? I've written a lot of mistakes so far...
I did not find any examples in the data table examples in which the columns to be used in calculations for each row was itself a variable, I thank you for your advice.
I was not asking anybody to write code for me, I'm asking for advice on how to get started with this problem.

First of all, when creating a reproducible example using functions such as sample (which set a random seed each time you run it), you should use set.seed.
Second of all, instead of looping over each row, you could just loop over the lookup list which will always be smaller than the data (many times significantly smaller) and combine it with rowMeans. You can also do it with base R, but you asked for a data.table solution so here goes (for the purposes of this solution I've converted all 9 to NAs, but you can try to generalize this to your specific case too)
So using set.seed(123), your function gives
apply(dat, 1, function(x) getMean(x, lookup))
# [1] 2.000000 5.000000 4.666667 4.500000 2.500000 1.000000 4.000000 2.333333 2.500000 1.500000
And here's a possible data.table application which runs only over the lookup list (for loops on lists are very efficient in R btw, see here)
## convert all 9 values to NAs
is.na(dat) <- dat == 9L
## convert your original data to `data.table`,
## there is no need in additional copy of the data if the data is huge
setDT(dat)
## loop only over the list
for(i in names(lookup)) {
dat[grade == i, res := rowMeans(as.matrix(.SD[, lookup[[i]], with = FALSE]), na.rm = TRUE)]
}
dat
# id grade A B C D res
# 1: 1 3 2 NA NA NA 2.000000
# 2: 2 4 5 3 5 NA 5.000000
# 3: 3 5 3 5 4 5 4.666667
# 4: 4 6 NA 4 NA 5 4.500000
# 5: 5 7 NA 1 4 1 2.500000
# 6: 6 3 1 NA 5 3 1.000000
# 7: 7 4 4 2 4 5 4.000000
# 8: 8 5 NA 1 4 2 2.333333
# 9: NA 6 4 2 2 2 2.500000
# 10: 10 7 3 NA 1 2 1.500000
Possibly, this could be improved utilizing set, but I can't think of a good way currently.
P.S.
As suggested by #Arun, please take a look at the vignettes he himself wrote here in order to get familiar with the := operator, .SD, with = FALSE, etc.

Here's another data.table approach using melt.data.table (needs data.table 1.9.5+) and then joins between data.tables:
DT_m <- setkey(melt.data.table(DT, c("id", "grade"), value.name = "score"), grade, variable)
lookup_dt <- data.table(grade = rep(as.integer(names(lookup)), lengths(lookup)),
variable = unlist(lookup), key = "grade,variable")
score_summary <- setkey(DT_m[lookup_dt, nomatch = 0L,
.(res = mean(score[score != 9], na.rm = TRUE)), by = id], id)
setkey(DT, id)[score_summary, res := res]
# id grade A B C D mean_score
# 1: 1 3 9 9 1 4 NaN
# 2: 2 4 5 4 1 5 3.0
# 3: 3 5 1 3 5 9 4.0
# 4: 4 6 5 2 4 5 4.0
# 5: 5 7 9 1 1 3 2.0
# 6: 6 3 3 3 4 3 3.0
# 7: 7 4 9 2 9 2 NaN
# 8: 8 5 3 9 2 9 2.0
# 9: 9 6 2 3 2 5 3.0
#10: 10 7 3 2 4 1 2.5
It's more verbose, but just over twice as fast:
microbenchmark(da_method(), nk_method(), times = 1000)
#Unit: milliseconds
# expr min lq mean median uq max neval
# da_method() 17.465893 17.845689 19.249615 18.079206 18.337346 181.76369 1000
# nk_method() 7.047405 7.282276 7.757005 7.489351 7.667614 20.30658 1000

Related

Combining elements of one column into two columns by group in R

Given a two column data.frame with one containing group labels and a second containing integer values ordered from smallest to largest. How can the data be expanded creating pairs of combinations of the integer column?
Not sure the best way to state this. I'm not interested in all possible combinations but instead all unique combinations starting from the lowest value.
In r, the combn function gives the desired output not considering groups, for example:
t(combn(seq(1:4),2))
[,1] [,2]
[1,] 1 2
[2,] 1 3
[3,] 1 4
[4,] 2 3
[5,] 2 4
[6,] 3 4
Since the first values is 1 we get the unique combination of (1,2) and not the additional combination of (2,1) which I don't need. How would one then apply a similar method by groups?
for example given a data.frame
test <- data.frame(Group = rep(c("A","B"),each=4),
Val = c(1,3,6,8,2,4,5,7))
test
Group Val
1 A 1
2 A 3
3 A 6
4 A 8
5 B 2
6 B 4
7 B 5
8 B 7
I was able to come up with this solution that gives the desired output:
test <- data.frame(Group = rep(c("A","B"),each=4),
Val = c(1,3,6,8,2,4,5,7))
j=1
for(i in unique(test$Group)){
if(j==1){
one <- filter(test,i == Group)
two <- data.frame(t(combn(one$Val,2)))
test1 <- data.frame(Group = i,Val1=two$X1,Val2=two$X2)
j=j+1
}else{
one <- filter(test,i == Group)
two <- data.frame(t(combn(one$Val,2)))
test2 <- data.frame(Group = i,Val1=two$X1,Val2=two$X2)
test1 <- rbind(test1,test2)
}
}
test1
Group Val1 Val2
1 A 1 3
2 A 1 6
3 A 1 8
4 A 3 6
5 A 3 8
6 A 6 8
7 B 2 4
8 B 2 5
9 B 2 7
10 B 4 5
11 B 4 7
12 B 5 7
However, this is not elegant and is really slow as the number of groups and length of each group become large. It seems like there should be a more elegant and efficient solution but so far I have not come across anything on SO.
I would appreciate any ideas!
here is a data.table approach
library( data.table )
#make test a data.table
setDT(test)
#split by group
L <- split( test, by = "Group")
#get unique combinations of 2 Vals
L2 <- lapply( L, function(x) {
as.data.table( t( combn( x$Val, m = 2, simplify = TRUE ) ) )
})
#merge them back together
data.table::rbindlist( L2, idcol = "Group" )
# Group V1 V2
# 1: A 1 3
# 2: A 1 6
# 3: A 1 8
# 4: A 3 6
# 5: A 3 8
# 6: A 6 8
# 7: B 2 4
# 8: B 2 5
# 9: B 2 7
#10: B 4 5
#11: B 4 7
#12: B 5 7
You can set simplify = F in combn() and then use unnest_wider() in dplyr.
library(dplyr)
library(tidyr)
test %>%
group_by(Group) %>%
summarise(Val = combn(Val, 2, simplify = F)) %>%
unnest_wider(Val, names_sep = "_")
# Group Val_1 Val_2
# <chr> <dbl> <dbl>
# 1 A 1 3
# 2 A 1 6
# 3 A 1 8
# 4 A 3 6
# 5 A 3 8
# 6 A 6 8
# 7 B 2 4
# 8 B 2 5
# 9 B 2 7
# 10 B 4 5
# 11 B 4 7
# 12 B 5 7
library(tidyverse)
df2 <- split(df$Val, df$Group) %>%
map(~gtools::combinations(n = 4, r = 2, v = .x)) %>%
map(~as_tibble(.x, .name_repair = "unique")) %>%
bind_rows(.id = "Group")

Iterate through columns to sum the previous 2 numbers of each row

In R, I have a dataframe, with columns 'A', 'B', 'C', 'D'. The columns have 100 rows.
I need to iterate through the columns to perform a calculation for all rows in the dataframe which sums the previous 2 rows of that column, and then set in new columns ('AA', 'AB', etc) what that sum is:
A B C D
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
5 6 7 8
6 7 8 9
to
A B C D AA AB AC AD
1 2 3 4 NA NA NA NA
2 3 4 5 3 5 7 9
3 4 5 6 5 7 9 11
4 5 6 7 7 9 11 13
5 6 7 8 9 11 13 15
6 7 8 9 11 13 15 17
Can someone explain how to create a function/loop that allows me to set the columns I want to iterate over (selected columns, not all columns) and the columns I want to set?
A base one-liner:
cbind(df, setNames(df + df[c(NA, 1:(nrow(df)-1)), ], paste0("A", names(df))))
If your data is large, this one might be the fastest because it manipulates the entire data.frame.
A dplyr solution using mutate() with across().
library(dplyr)
df %>%
mutate(across(A:D,
~ .x + lag(.x),
.names = "A{col}"))
# A B C D AA AB AC AD
# 1 1 2 3 4 NA NA NA NA
# 2 2 3 4 5 3 5 7 9
# 3 3 4 5 6 5 7 9 11
# 4 4 5 6 7 7 9 11 13
# 5 5 6 7 8 9 11 13 15
# 6 6 7 8 9 11 13 15 17
If you want to sum the previous 3 rows, the second argument of across(), i.e. .fns, should be
~ .x + lag(.x) + lag(.x, 2)
which is equivalent to the use of rollsum() in zoo:
~ zoo::rollsum(.x, k = 3, fill = NA, align = 'right')
Benchmark
A benchmark test with microbenchmark package on a new data.frame with 10000 rows and 100 columns and evaluate each expression for 10 times.
# Unit: milliseconds
# expr min lq mean median uq max neval
# darren_base 18.58418 20.88498 35.51341 33.64953 39.31909 80.24725 10
# darren_dplyr_lag 39.49278 40.27038 47.26449 42.89170 43.20267 76.72435 10
# arg0naut91_dplyr_rollsum 436.22503 482.03199 524.54800 516.81706 534.94317 677.64242 10
# Grothendieck_rollsumr 3423.92097 3611.01573 3650.16656 3622.50895 3689.26404 4060.98054 10
You can use dplyr's across (and set optional names) with rolling sum (as implemented e.g. in zoo):
library(dplyr)
library(zoo)
df %>%
mutate(
across(
A:D,
~ rollsum(., k = 2, fill = NA, align = 'right'),
.names = 'A{col}'
)
)
Output:
A B C D AA AB AC AD
1 1 2 3 4 NA NA NA NA
2 2 3 4 5 3 5 7 9
3 3 4 5 6 5 7 9 11
4 4 5 6 7 7 9 11 13
5 5 6 7 8 9 11 13 15
6 6 7 8 9 11 13 15 17
With A:D we've specified the range of column names we want to apply the function to. The assumption above in .names argument is that you want to paste together A as prefix and the column name ({col}).
Here's a data.table solution. As you ask for, it allows you to select which columns you want to apply it to rather than just for all columns.
library(data.table)
x <- data.table(A=1:6, B=2:7, C=3:8, D=4:9)
selected_cols <- c('A','B','D')
new_cols <- paste0("A",selected_cols)
x[, (new_cols) := lapply(.SD, function(col) col+shift(col, 1)), .SDcols = selected_cols]
x[]
NB This is 2 or 3 times faster than the fastest other answer.
That is a naive approach with nested for loops. Beware it is damn slow if you gonna iterate over hundreds thousand rows.
i <- 1
n <- 5
df <- data.frame(A=i:(i+n), B=(i+1):(i+n+1), C=(i+2):(i+n+2), D=(i+3):(i+n+3))
for (col in colnames(df)) {
for (ind in 1:nrow(df)) {
if (ind-1==0) {next}
s <- sum(df[c(ind-1, ind), col])
df[ind, paste0('S', col)] <- s
}
}
That is a cumsum method:
na.df <- data.frame(matrix(NA, 2, ncol(df)))
colnames(na.df) <- colnames(df)
cs1 <- cumsum(df)
cs2 <- rbind(cs1[-1:-2,], na.df)
sum.diff <- cs2-cs1
cbind(df, rbind(na.df[1,], cs1[2,], sum.diff[1:(nrow(sum.diff)-2),]))
Benchmark:
# Unit: milliseconds
# expr min lq mean median uq max neval
# darrentsai.rbind 11.5623 12.28025 23.38038 16.78240 20.83420 91.9135 100
# darrentsai.rbind.rev1 8.8267 9.10945 15.63652 9.54215 14.25090 62.6949 100
# pseudopsin.dt 7.2696 7.52080 20.26473 12.61465 17.61465 69.0110 100
# ivan866.cumsum 25.3706 30.98860 43.11623 33.78775 37.36950 91.6032 100
I believe, most of the time the cumsum method wastes on df allocations. If correctly adapted to data.table backend, it could be the fastest.
Specify the columns we want. We show several different ways to do that. Then use rollsumr to get the desired columns, set the column names and cbind DF with it.
library(zoo)
# jx <- names(DF) # if all columns wanted
# jx <- sapply(DF, is.numeric) # if all numeric columns
# jx <- c("A", "B", "C", "D") # specify columns by name
jx <- 1:4 # specify columns by position
r <- rollsumr(DF[jx], 2, fill = NA)
colnames(r) <- paste0("A", colnames(r))
cbind(DF, r)
giving:
A B C D AA AB AC AD
1 1 2 3 4 NA NA NA NA
2 2 3 4 5 3 5 7 9
3 3 4 5 6 5 7 9 11
4 4 5 6 7 7 9 11 13
5 5 6 7 8 9 11 13 15
6 6 7 8 9 11 13 15 17
Note
The input in reproducible form:
DF <- structure(list(A = 1:6, B = 2:7, C = 3:8, D = 4:9),
class = "data.frame", row.names = c(NA, -6L))

Averaging row and column cells from multiple data frames

I have multiple data frames, like:
DG = data.frame(y=c(1,3), v=3:8, x=c(4,6))
DF = data.frame(y=c(1,3), v=3:8, x=c(12,14))
DT = data.frame(y=c(1,3), v=3:8, x=c(4,5))
head(DG)
y v x
1 1 3 4
2 3 4 6
3 1 5 4
4 3 6 6
5 1 7 4
6 3 8 6
head(DT)
y v x
1 1 3 4
2 3 4 5
3 1 5 4
4 3 6 5
5 1 7 4
6 3 8 5
head(DF)
y v x
1 1 3 12
2 3 4 12
3 1 5 12
4 3 6 12
5 1 7 12
6 3 8 12
I want to calculate means of each 'row' but from each column of each data frame, i.e. the resulting data frame I need looks like:
y v x
1 'mean(DG(y1)DT(y1),DF(y1))' 'mean(DG(v1)DT(v1),DF(v1))' 'mean(DG(x1)DT(x1),DF(x1))'
2 'mean(DG(y2)DT(y2),DF(y2))' 'mean(DG(v2)DT(v2),DF(v2))' 'mean(DG(x2)DT(x2),DF(x2))'
3 'mean(DG(y3)DT(y3),DF(y3))' 'mean(DG(v3)DT(v3),DF(v3))' 'mean(DG(x3)DT(x3),DF(x3))'
....
In reality, y, v and x are different locations and 1 - 6 time steps. I want to average my data for each time step and location. Eventually, I need one data set, that looks like one of the example data sets, but with averaged values in each cell.
I have a working example with loops, but for large datasets it is very slow, so I tried various combinations with apply and rowSums, but neither worked out.
If I understand correctly, there are many data frames which all have the same structure (number, name and type of columns) as well as the same number of rows (time steps). Some data points may contain NA.
The code below creates a large data.table from the single data frames and computes the mean values for each time step and location across the different data frames:
library(data.table)
rbindlist(list(DG, DF, DT), idcol = TRUE)[
, lapply(.SD, mean, na.rm = TRUE), by = .(time_step = rowid(.id))]
time_step y v x
1: 1 1 3 6.666667
2: 2 3 4 8.333333
3: 3 1 5 6.666667
4: 4 3 6 8.333333
5: 5 1 7 6.666667
6: 6 3 8 8.333333
This will work also with NAs, e.g.,
DG = data.frame(y=c(1,3), v=3:8, x=c(4,6))
DF = data.frame(y=c(1,3), v=3:8, x=c(12,14))
DT = data.frame(y=c(1,3), v=3:8, x=c(4,5,NA))
Note that column x of DT has been modified
rbindlist(list(DG, DF, DT), idcol = TRUE)[
, lapply(.SD, mean, na.rm = TRUE), by = .(time_step = rowid(.id))]
time_step y v x
1: 1 1 3 6.666667
2: 2 3 4 8.333333
3: 3 1 5 8.000000
4: 4 3 6 8.000000
5: 5 1 7 7.000000
6: 6 3 8 10.000000
Note that x in rows 3 and 6 has changed.
If you only have the three data frames, I would recommend
result = (DG + DT + DF) / 3
result
# y v x
# 1 1 3 6.666667
# 2 3 4 8.333333
# 3 1 5 6.666667
# 4 3 6 8.333333
# 5 1 7 6.666667
# 6 3 8 8.333333
This assumes that your rows and columns are already in the correct order.
If you have more data frames, put them in a list (see here for help with that) and then you can do this:
result = Reduce("+", list_of_data) / length(list_of_data)
If you need advanced features of mean, like ignoring NAs or trimming, this won't work. Instead, I would recommend using converting your data frames to matrices, stacking them into an 3-d array, and applying mean.
library(abind)
stack = abind(DG, DF, DT, along = 3)
# if you have data frames in a list, do this instead:
# stack = do.call(abind, c(list_of_data, along = 3))
apply(stack, MARGIN = 1:2, FUN = mean, na.rm = TRUE)
# y v x
# [1,] 1 3 6.666667
# [2,] 3 4 8.333333
# [3,] 1 5 6.666667
# [4,] 3 6 8.333333
# [5,] 1 7 6.666667
# [6,] 3 8 8.333333
The final method I'll recommend is a "tidy" method - combine your data into one data frame and use grouped operations to produce the result. This can be done easily with data.table or dplyr. See Uwe's answer for a nice data.table implementation.
library(dplyr)
bind_rows(list(DG, DF, DT), .id = ".id") %>%
group_by(.id) %>%
mutate(rn = row_number()) %>%
ungroup() %>%
select(-.id) %>%
group_by(rn) %>%
summarize_all(mean, na.rm = TRUE) %>%
select(-rn)
# # A tibble: 6 x 3
# y v x
# <dbl> <dbl> <dbl>
# 1 1 3 6.67
# 2 3 4 8.33
# 3 1 5 6.67
# 4 3 6 8.33
# 5 1 7 6.67
# 6 3 8 8.33

Replace Inf/-Inf values from vector of variable names, with values from similarly named vector of variables (substr/grep/gsub)

I'm currently stumped making some efficient code. I have a vector of variables (med.vars) that were transformed by the in-year global median. Sometimes the global median is 0, which creates Inf/-Inf values I would like to replace with the pre-transformed variable value (vars). I can't figure out how to do this efficiently with some type of data.table 'dat[,:=lapply(.SD), .SDcols=med.vars] function or a for loop with get(), noquotes(), etc.
dat<-data.table(v1=c(2,10,7),v2=c(5,6,5),v3=c(10,15,20),v1.med=c(1,Inf,5),v2.med=c(5,6,5),v3.med=c(-Inf,2,3))
vars<-c("v1","v2","v3")
med.vars<-c("v1.med","v2.med","v3.med")
v1 v2 v3 v1.med v2.med v3.med
1: 2 5 10 1 5 -Inf
2: 10 6 15 Inf 6 2
3: 7 5 20 5 5 3
In reality these vectors are 50+ vars I pull from names(dat) with grep() and use gsub(".med","",med.vars) to create the second vector of pre-transformed variable names.
I would like to efficiently perform
dat[v1.med==Inf | v1.med==-Inf, v1.med:=v1]
dat[v3.med==Inf | v3.med==-Inf, v3.med:=v3]
for each element, med.vars[i], and its corresponding element, vars[i] such that the resulting data.table is:
v1 v2 v3 v1.med v2.med v3.med
1: 2 5 10 1 5 -10
2: 10 6 15 10 6 2
3: 7 5 20 5 5 3
Thank you for your time
OP mentions efficiency, so maybe move to long form. Then the standard syntax can be used:
DT = melt(dat, meas=list(vars, med.vars), value.name=c("var", "med"))
DT[!is.finite(med), med := sign(med)*var]
variable var med
1: 1 2 1
2: 1 10 10
3: 1 7 5
4: 2 5 5
5: 2 6 6
6: 2 5 5
7: 3 10 -10
8: 3 15 2
9: 3 20 3
As these are corresponding columns, we can make use of Map
dat[, (med.vars) := Map(function(x, y) ifelse(is.finite(y), y,
x * sign(y)), .SD[, vars, with = FALSE],
.SD[, med.vars, with = FALSE])]
dat
# v1 v2 v3 v1.med v2.med v3.med
#1: 2 5 10 1 5 -10
#2: 10 6 15 10 6 2
#3: 7 5 20 5 5 3
Or another option is set by looping through the columns with a for loop
for(j in seq_along(vars)) {
i1 <- !is.finite(dat[[med.vars[j]]])
v1 <- dat[[vars[j]]]
v2 <- dat[[med.vars[j]]]
set(dat, i = which(i1), j = med.vars[j], value = sign(v2[i1]) * v1[i1])
}
This can also be done in base R (on a data.frame)
i1 <- !sapply(dat[med.vars], is.finite)
dat[med.vars][i1] <- dat[vars][i1] * sign(dat[med.vars][i1])

Correlations by grouping twice in R, using dplyR or aggregate?

My (toy) data looks like:
Item_Id Location_Id date price
1 A 5372 1 .5
2 A 5372 2 NA
3 A 5372 3 1
4 A 6065 1 1
5 A 6065 2 1
6 A 6065 3 3
7 A 7000 1 NA
8 A 7000 2 NA
9 A 7000 3 NA
10 B 5372 1 3
11 B 5372 2 NA
12 B 5372 3 1
13 B 6065 1 2
14 B 6065 2 1
15 B 6065 3 3
16 B 7000 1 8
17 B 7000 2 NA
18 B 7000 3 9
In reality there are hundreds of unique item_Ids and location_Ids.
Data
Item_Id=c(rep('A',9),rep('B',9))
Location_Id=rep(c(rep(5372,3),rep(6065,3),rep(7000,3)),2)
date = rep(1:3,6)
price = c(0.5,NA,1,1,1,3,NA,NA,NA,3,NA,1,2,1,3,8,NA,9)
df = data.frame(Item_Id,Location_Id,date,price)
I want to ultimate get the median correlation (over locations) of the prices series for every item with every other item. I tried writing a loop in the hopes that it would be quick (not finished):
for(item in items){
remainingitems = items[items!=item]
for(item2 in remainingitems){
cortemp = numeric(0)
for(locat in locations){
print(locat)
a = pricepanel[pricepanel$Item_Id==item &
pricepanel$Location_Id==locat,]$price
b = pricepanel[pricepanel$Item_Id==item2 &
pricepanel$Location_Id==locat,]$price
cortemp=c(cortemp,cor(cbind(a,b), use="pairwise.complete.obs")[2])
}
}
But I stopped because it was much too slow. The most inner loop took several minutes alone and there are hundreds of stores and items. Basically I want to get the correlation matrix (every product with every other product) for every location, and then take the element-wise median across those matrices.
I expect there is an efficient way to do this, but I am new to this kind of thing in R. I tried reading dplyr since I suspect the solution lies in there, but I got stuck.
The interim output would be something like:
$5752
A B
A 1 -1
B -1 1
$6065
A B
A 1 0.8660254
B 0.8660254 1
$7000
A B
A 1 NA
B NA 1
Then the final would take the elementwise median of all those location matrices.
Final:
A B
A 1 -.0669873
B -.0669873 1
You could get the "interim" output using dplyr and tidyr:
library(dplyr)
library(tidyr)
cors <- df %>% spread(Item_Id, price) %>%
group_by(Location_Id) %>%
do(correlation = cor(.[, -(1:2)], use = "pairwise.complete.obs"))
The way that this works is that the spread function (from tidyr) spreads the As, Bs, Cs etc into their own columns:
df %>% spread(Item_Id, price)
# Location_Id date A B
# 1 5372 1 0.5 3
# 2 5372 2 NA NA
# 3 5372 3 1.0 1
# 4 6065 1 1.0 2
# 5 6065 2 1.0 1
# 6 6065 3 3.0 3
# 7 7000 1 NA 8
# 8 7000 2 NA NA
# 9 7000 3 NA 9
(This should work with any number of "Items"- A, B, C, D...) The group_by(Location_Id) function then tells the code to operate within each location. Finally the do command tells it to find the correlation of the columns within each group (. is a placeholder for "the data within each group"), while ignoring the first two columns, Location_Id and date.
The above code produces a result that looks like:
# Source: local data frame [3 x 2]
# Groups: <by row>
#
# Location_Id correlation
# 1 5372 <dbl[2,2]>
# 2 6065 <dbl[2,2]>
# 3 7000 <dbl[2,2]>
The correlation column is a list of your three within-location matrices. At that point you can use the solution in this question to take the elementwise median:
apply(simplify2array(cors$correlation), c(1,2), median, na.rm = TRUE)
Here's a possible split apply solution using base R
lapply(split(df[, c("Item_Id", "price")], df$Location_Id),
function(x) {
cor(matrix(x$price, nrow = nrow(x)/length(unique(x$Item_Id))), use ="pairwise.complete.obs")
} )
# $`5372`
# [,1] [,2]
# [1,] 1 -1
# [2,] -1 1
#
# $`6065`
# [,1] [,2]
# [1,] 1.0000000 0.8660254
# [2,] 0.8660254 1.0000000
#
# $`7000`
# [,1] [,2]
# [1,] NA NA
# [2,] NA 1
And here's a similar solution to #Davids using data.table package
library(data.table)
DT <- dcast.data.table(as.data.table(df),
Location_Id + date ~ Item_Id,
value.var = "price")[, -2, with = FALSE]
Res <- DT[, .(Res = list(cor(.SD, use = "pairwise.complete.obs"))), Location_Id]
You can then view the cor matrices using
Res$Res
# [[1]]
# A B
# A 1 -1
# B -1 1
#
# [[2]]
# A B
# A 1.0000000 0.8660254
# B 0.8660254 1.0000000
#
# [[3]]
# A B
# A NA NA
# B NA 1

Resources