How to convert a >2 column dataframe to 2 columns - r

I am having problems transforming my data.
I have a dataframe, which tells which transitions were made, and how many times this sequence of transitions has occured. The different columns are corresponding with the situation in period 10, 11 and 12 (and there are more in my data). I want to summarize this data, and want to know how many times people went from A to C, A to D, but also C to G, etc. So basically I want to aggregate this data based on each column with the second column. My ultimate goal is to turn this data into a sankey diagram.
To illustrate:
df<-data.frame(s10=unlist(strsplit("AAAABBBBBC","")),
s11=unlist(strsplit("CCDDEEFFFF","")),
s12=unlist(strsplit("GHIGJKMNNN","")),
freq=c(10,20,30,40,50,60,70, 40, 20, 20))
s10 s11 s12 freq
1 A C G 10
2 A C H 20
3 A D I 30
4 A D G 40
5 B E J 50
6 B E K 60
7 B F M 70
8 B F N 40
9 B F N 20
10 C F N 20
And I aim to get this result:
colA colB freq
1 A C 30
2 A D 70
3 B E 110
4 B F 130
5 C F 20
6 C G 10
7 C H 20
8 D G 40
9 D I 30
10 E J 50
11 E K 60
12 F M 70
13 F N 80
I got this result by first aggregating the sum of frequencies for s10 and s11, and for s11 and s12, and then changing the column names and binding them together. It works for, but I intend to do this with more columns, so I'm sure there is a more efficient way to do this. See the code I used below:
bl1 <- df %>%
group_by(s10, s11) %>%
summarise(freq = sum(freq)) %>%
as.data.frame()
bl2 <- df %>%
group_by(s11, s12) %>%
summarise(freq = sum(freq)) %>%
as.data.frame()
colnames(bl1) <- c('colA', 'colB','freq' )
colnames(bl2) <- c('colA', 'colB','freq' )
rbind(bl1, bl2)
Any help is much appreciated!

You can rbind the selected columns of the data.frames together and then use aggregate. The only trick is to rename the columns so that they match. For this, I use setNames.
aggregate(freq ~ colA + colB,
data=rbind(setNames(df[c("s10", "s11", "freq")], c("colA", "colB", "freq")),
setNames(df[c("s11", "s12", "freq")], c("colA", "colB", "freq"))),
FUN=sum)
this returns the desired result.
colA colB freq
1 A C 30
2 A D 70
3 B E 110
4 B F 130
5 C F 20
6 C G 10
7 D G 40
8 C H 20
9 D I 30
10 E J 50
11 E K 60
12 F M 70
13 F N 80

OK. I gave it a go and had some fun with benchmarking. An alternate aproach (that I used) was to use aggregate() itself. See fun1 for implementation. I have made it to fit this particular example, and of course it will need tweaking to work with data frames of other widths
Edit: I have removed dataframe creation from functions and added Benchmarking output1
require(dplyr); require(microbenchmark)
df<-data.frame(s10=unlist(strsplit("AAAABBBBBC","")),
s11=unlist(strsplit("CCDDEEFFFF","")),
s12=unlist(strsplit("GHIGJKMNNN","")),
freq=c(10,20,30,40,50,60,70, 40, 20, 20))
fun0<- function(){
bl1 <- df %>%
group_by(s10, s11) %>%
summarise(freq = sum(freq)) %>%
as.data.frame()
bl2 <- df %>%
group_by(s11, s12) %>%
summarise(freq = sum(freq)) %>%
as.data.frame()
colnames(bl1) <- c('colA', 'colB','freq' )
colnames(bl2) <- c('colA', 'colB','freq' )
return(rbind(bl1, bl2))
}
fun1<- function(){
a<- apply(df[,c(1,2)], 1, function(x)paste(x[1],x[2], sep="",collapse = "" ))
b<- apply(df[,c(2,3)], 1, function(x)paste(x[1],x[2], sep="",collapse = "" ))
z<-data.frame(x=c(a,b),f=rep(df$freq,2))
return( aggregate( f~x , data=z, FUN=sum) )
}
fun0()
fun1()
#benchmarking
MB_res <- microbenchmark( fun0=fun0(), fun1=fun1() , times=1000)
MB_res
Results were:
Unit: milliseconds
expr min lq mean median uq max neval
fun0 2.218889 2.587820 2.773454 2.676921 2.785586 6.020277 1000
fun1 1.472971 1.737751 1.908966 1.842152 1.910118 8.915407 1000

Related

Count common sets of items between different customers

I have data on customers and the different products they have purchased:
Customer Product
1 A
1 B
1 C
2 D
2 E
2 F
3 A
3 B
3 D
4 A
4 B
I would like to check which sets of products that occur together across different customers. I want to get the count for product combinations of different lengths. For example, the product combination A and B together occurs in three different customers; the product group A, B and C occurs in one customer. And so on for all different sets of 2 or more products in the data. Something like:
Product Group Number
A, B, C 1
D, E, F 1
A, B, D 1
A, B 3
Thus, I'm counting the A, B combination in customers who only have product A and B (e.g. customer 4), and in customers who have A and B, but also any other product (e.g. customer 1, who has A, B and C).
Does anyone have any ideas how to do that with either a tidyverse or base R approach? I feel like it ought to be pretty trivial - maybe pivot_wider first, then count?
I have found this question and answer that can do what I need for pairs of products, but I need to count combinations also for more products than two.
If you have the possibility to use a non-base package, you can use a tool dedicated for the task of finding item sets: arules::apriori. It is much faster on larger data sets.
library(arules)
# coerce data frame to binary incidence matrix
# use apriori to get "frequent itemsets"
r = apriori(data = as.matrix(table(dat) > 0),
# set: type of association mined, minimal support needed of an item set,
# minimal number of items per item set
par = list(target = "frequent itemsets",
support = 0,
minlen = 2))
# coerce itemset to data.frame, select relevant rows and columns
d = as(r, "data.frame")
d[d$count > 0, c("items", "count")]
# items count
# 4 {B,C} 1
# 5 {A,C} 1
# 6 {E,F} 1
# 7 {D,E} 1
# 10 {D,F} 1
# 13 {B,D} 1
# 14 {A,D} 1
# 15 {A,B} 3
# 25 {A,B,C} 1
# 26 {D,E,F} 1
# 35 {A,B,D} 1
Timing on larger data set: 10000 customers with up to 6 products each. apriori is quite a lot faster.
# Unit: milliseconds
# expr min lq mean median uq max neval
# f_henrik(dat) 38.95475 39.8621 41.44454 40.67313 41.05565 57.64655 20
# f_allan(dat) 4578.20595 4622.2363 4664.57187 4654.58713 4679.78119 4924.22537 20
# f_jay(dat) 2799.10516 2939.9727 2995.90038 2971.24127 2999.82019 3444.70819 20
# f_uwe_dt(dat) 2943.26219 3007.1212 3028.37550 3027.46511 3060.38380 3076.25664 20
# f_uwe_dplyr(dat) 6339.03141 6375.7727 6478.77979 6448.56399 6521.54196 6816.09911 20
10000 customers with up to 10 products each. apriori is several hundred times faster.
# Unit: milliseconds
# expr min lq mean median uq max neval
# f_henrik(dat) 58.40093 58.95241 59.71129 59.63988 60.43591 61.21082 20
# f_jay(dat) 52824.67760 53369.78899 53760.43652 53555.69881 54049.91600 55605.47980 20
# f_uwe_dt(dat) 22612.87954 22820.12012 22998.85072 22974.32710 23220.00390 23337.22815 20
# f_uwe_dplyr(dat) 26083.20240 26255.88861 26445.49295 26402.67887 26659.81195 27046.83491 20
On the larger data set, Allan's code gave warnings (In rawToBits(as.raw(x)) : out-of-range values treated as 0 in coercion to raw) on the toy data, which seemed to affect the result. Thus, it is not included in the second benchmark.
Data and benchmark code:
set.seed(3)
n_cust = 10000
n_product = sample(2:6, n_cust, replace = TRUE) # 2:10 in second run
dat = data.frame(
Customer = rep(1:n_cust, n_product),
Product = unlist(lapply(n_product, function(n) sample(letters[1:6], n)))) # 1:10 in 2nd run
library(microbenchmark)
res = microbenchmark(f_henrik(dat),
f_allan(dat),
f_jay(dat),
f_uwe_dt(dat),
f_uwe_dplyr(dat),
times = 20L)
Check for equality:
henrik = f_henrik(dat)
allan = f_allan(dat)
jay = f_jay(dat)
uwe_dt = f_uwe_dt(dat)
uwe_dplyr = f_uwe_dplyr(dat)
# change outputs to common format for comparison
# e.g. string format, column names, order
henrik$items = substr(henrik$items, 2, nchar(henrik$items) - 1)
henrik$items = gsub(",", ", ", henrik$items)
l = list(
henrik = henrik, allan = allan, jay = jay, uwe_dt = uwe_dt, uwe_dplyr = uwe_dplyr)
l = lapply(l, function(d){
d = setNames(as.data.frame(d), c("items", "count"))
d = d[order(d$items), ]
row.names(d) = NULL
d
})
all.equal(l[["henrik"]], l[["allan"]])
# TRUE
all.equal(l[["henrik"]], l[["jay"]])
# TRUE
all.equal(l[["henrik"]], l[["uwe_dt"]])
# TRUE
all.equal(l[["henrik"]], l[["uwe_dplyr"]])
# TRUE
Functions:
f_henrik = function(dat){
r = apriori(data = as.matrix(table(dat) > 0),
par = list(target = "frequent itemsets",
support = 0,
minlen = 2))
d = as(r, "data.frame")
d[d$count > 0, c("items", "count")]
}
f_allan = function(dat){
all_multiples <- function(strings)
{
n <- length(strings)
do.call("c", sapply(1:2^n, function(x) {
mystrings <- strings[as.character(rawToBits(as.raw(x))[seq(n)]) == "01"]
if (length(mystrings) > 1) paste(mystrings, collapse = ", ") else NULL
}))
}
dat %>%
group_by(Customer) %>%
arrange(Product) %>%
summarize(Product_group = all_multiples(Product)) %>%
group_by(Product_group) %>%
count(Product_group)
}
f_jay = function(dat){
a <- split(dat$Product, dat$Customer) ## thx to #Henrik
r <- range(lengths(a))
pr <- unlist(lapply(r[1]:r[2], function(x)
combn(unique(dat$Product), x, list)), recursive=F)
or <- rowSums(outer(pr, a, Vectorize(function(x, y) all(x %in% y))))
res <- data.frame(p.group=sapply(pr, toString), number=or)
res[res$number > 0, ]
}
f_uwe_dt = function(dat){
setorder(setDT(dat), Customer, Product)
dat[, .(Product.Group = unlist(lapply(tail(seq(.N), -1L),
function(m) combn(unique(Product), m, toString, FALSE)))),
by = Customer][
, .N, by = Product.Group]
}
f_uwe_dplyr = function(dat){
dat %>%
arrange(Customer, Product) %>%
group_by(Customer) %>%
summarise(Product.Group = n() %>%
seq() %>%
tail(-1L) %>%
lapply(function(m) combn(unique(Product), m, toString, FALSE)) %>%
unlist()) %>%
ungroup() %>%
count(Product.Group)
}
If you define a little helper function that gets all multiple groupings:
all_multiples <- function(strings)
{
n <- length(strings)
do.call("c", sapply(1:2^n, function(x) {
mystrings <- strings[as.character(rawToBits(as.raw(x))[seq(n)]) == "01"]
if (length(mystrings) > 1) paste(mystrings, collapse = ", ") else NULL
}))
}
then you can do this nicely in a tidyverse pipe:
dat %>%
group_by(Customer) %>%
arrange(Product) %>%
summarize(Product_group = all_multiples(Product)) %>%
group_by(Product_group) %>%
count(Product_group)
#> # A tibble: 11 x 2
#> # Groups: Product_group [11]
#> Product_group n
#> <chr> <int>
#> 1 A, B 3
#> 2 A, B, C 1
#> 3 A, B, D 1
#> 4 A, C 1
#> 5 A, D 1
#> 6 B, C 1
#> 7 B, D 1
#> 8 D, E 1
#> 9 D, E, F 1
#> 10 D, F 1
#> 11 E, F 1
For the sake of completeness, here is a solution in data.table syntax which can be translated to dplyr syntax as well.
For both implementations, the core idea is the same:
sort by Product (which is an important step which has been neglected by the other answers posted so far)
For each Customer, create the product groups by using combn() with varying lengths m. Product.Group is a kind of natural key created by concatenating the included products using the toString() function.
Here, we can see why sorting Product is important : products B, A as well as A, B should appear in the same product group A, B.
Finally, count the number of occurrences by Product.Group
data.table version
library(data.table)
setorder(setDT(df), Customer, Product)
df[, .(Product.Group = unlist(lapply(tail(seq(.N), -1L),
function(m) combn(unique(Product), m, toString, FALSE)))),
by = Customer][
, .N, by = Product.Group]
Product.Group N
1: A, B 3
2: A, C 1
3: B, C 1
4: A, B, C 1
5: D, E 1
6: D, F 1
7: E, F 1
8: D, E, F 1
9: A, D 1
10: B, D 1
11: A, B, D 1
dplyr version
library(dplyr)
df %>%
arrange(Customer, Product) %>%
group_by(Customer) %>%
summarise(Product.Group = n() %>%
seq() %>%
tail(-1L) %>%
lapply(function(m) combn(unique(Product), m, toString, FALSE)) %>%
unlist()) %>%
ungroup() %>%
count(Product.Group)
Product.Group n
<chr> <int>
1 A, B 3
2 A, B, C 1
3 A, B, D 1
4 A, C 1
5 A, D 1
6 B, C 1
7 B, D 1
8 D, E 1
9 D, E, F 1
10 D, F 1
11 E, F 1
Data
library(data.table)
df <- fread("
Customer Product
1 A
1 B
1 C
2 D
2 E
2 F
3 A
3 B
3 D
4 A
4 B")
You could split the data along customers, then get all combinations of product-pairs and triples using combn. Then find matches using %in% with outer, create data frame by collapsing products using toString and finally discard elements which are zero.
# a <- aggregate(Product ~ Customer, dat, I)$Product ## old solution
# if (is.matrix(a)) a <- as.data.frame(t(a)) ## old solution
a <- split(dat$Product, dat$Customer) ## thx to #Henrik
r <- range(lengths(a))
pr <- unlist(lapply(r[1]:r[2], function(x)
combn(unique(dat$Product), x, list)), recursive=F)
or <- rowSums(outer(pr, a, Vectorize(function(x, y) all(x %in% y))))
res <- data.frame(p.group=sapply(pr, toString), number=or)
res[res$number > 0, ]
# p.group number
# 1 A, B 3
# 2 A, C 1
# 3 A, D 1
# 6 B, C 1
# 7 B, D 1
# 13 D, E 1
# 14 D, F 1
# 15 E, F 1
# 16 A, B, C 1
# 17 A, B, D 1
# 35 D, E, F 1
Data
dat <- read.table(header=TRUE, text="Customer Product
1 A
1 B
1 C
2 D
2 E
2 F
3 A
3 B
3 D
4 A
4 B")

How to add two columns from two different dataframes in R wherein one column just has subset of unique values of the other

I have two dataframes, I need to add two columns from those two and store the result in the original bigger dataframe, but the bigger dataframe has lot more 'branch' column than the smaller one. I tried using match but the non matching branches the sum is NA
Sample code:
> df1 <- data.frame(branch = letters[seq(1,5)],
+ rev = seq(10,50,10),
+ stringsAsFactors = 0)
> df1
branch rev
1 a 10
2 b 20
3 c 30
4 d 40
5 e 50
>
> df2 <- data.frame(branch = c('b','d'),
+ Amt = c(10,10),
+ stringsAsFactors = 0)
> df2
branch Amt
1 b 10
2 d 10
>
> df1$rev + df2[match(df1$branch,df2$branch),2,drop = 1]
[1] NA 30 NA 50 NA
>
Expected Output
> df1
branch rev
1 a 10
2 b 30
3 c 30
4 d 50
5 e 50
>
I tried using left join as below:
> left_join(df1, df2, by = 'branch')
branch rev Amt
1 a 10 NA
2 b 20 10
3 c 30 NA
4 d 40 10
5 e 50 NA
> df1 <- left_join(df1, df2, by = 'branch')
> df1[is.na(df1)] <- 0
> df1
branch rev Amt
1 a 10 0
2 b 20 10
3 c 30 0
4 d 40 10
5 e 50 0
> df1$rev <- df1$rev + df1$Amt
> df1
branch rev Amt
1 a 10 0
2 b 30 10
3 c 30 0
4 d 50 10
5 e 50 0
> df1$Amt <- NULL
> df1
branch rev
1 a 10
2 b 30
3 c 30
4 d 50
5 e 50
>
Could someone let me know if there's a simpler solution for this.
An option using data.table:
library(data.table)
setDT(df1)[, rev :=
setDT(df2)[.SD, on=.(branch), rev + nafill(Amt, fill=0)]
]
output:
branch rev
1: a 10
2: b 30
3: c 30
4: d 50
5: e 50
How about this, no libraries required:
df1 <- df1[order(df1$branch),] #sort based on branch
df2 <- df2[order(df2$branch),] #sort also so next step works
df1$branch[df1$branch %in% df2$branch] #just to check we are on correct path
#do the task
df1$rev[df1$branch %in% df2$branch] <- df1$rev[df1$branch %in% df2$branch] + df2$Amt[df2$branch %in% df1$branch]
Warning -- if there are repeated "branch" values in df2...e.g. two "b",
you would need to accumulate those before adding them to df1.
One way would to store the output of match in a variable, replace NA to 0 and then add values
vals <- df2$Amt[match(df1$branch,df2$branch)]
df1$rev + replace(vals, is.na(vals), 0)
#[1] 10 30 30 50 50
Something similar in dplyr, doing left_join instead of match
library(dplyr)
df1 %>%
left_join(df2, by = 'branch') %>%
mutate(Amt = replace(Amt, is.na(Amt), 0),
rev = rev + Amt) %>%
select(names(df1))
Using dplyr, you can aggregate both dataframes using bind_rows (and renaming Amt by rev in order to match colnames), group by "branch" and calculate the sum:
library(dplyr)
df1 %>% bind_rows(., rename(df2, rev = Amt)) %>%
group_by(branch) %>%
summarise(rev = sum(rev))
# A tibble: 5 x 2
branch rev
<chr> <dbl>
1 a 10
2 b 30
3 c 30
4 d 50
5 e 50
Use aggregate to get the sum of rev in different branch group .
library(magrittr)
colnames(df2)[2] <- "rev"
df1 <- rbind(df1, df2) %>% aggregate(rev ~ branch, ., FUN = sum)

For-loop to summarize and joining by dplyr

Here is my simplified df:
GP_A <- c(rep("a",3),rep("b",2),rep("c",2))
GP_B <- c(rep("d",2),rep("e",4),rep("f",1))
GENDER <- c(rep("M",4),rep("F",3))
LOC <- c(rep("HK",2),rep("UK",3),rep("JP",2))
SCORE <- c(50,70,80,20,30,80,90)
df <- as.data.frame(cbind(GP_A,GP_B,GENDER,LOC,SCORE))
> df
GP_A GP_B GENDER LOC SCORE
1 a d M HK 50
2 a d M HK 70
3 a e M UK 80
4 b e M UK 20
5 b e F UK 30
6 c e F JP 80
7 c f F JP 90
I want to summarize the score by GP_A, GP_B, or other grouping columns which are not showing in this example. As the count of grouping columns might up to 50, I decided to use for-loop to summarize the score.
The original method is summarizing the score with 1 group one by one:
GP_A_SCORE <- df %>% group_by(GP_A,GENDER,LOC) %>% summarize(SCORE=mean(SCORE))
GP_B_SCORE <- df %>% group_by(GP_B,GENDER,LOC) %>% summarize(SCORE=mean(SCORE))
...
What I want is using the for-loop like this (cannot run):
GP_list <- c("GP_A","GP_B",...)
LOC_list <- c("HK","UK","JP",...)
SCORE <- list()
for (i in GP_list){
for (j in LOC_list){
SCORE[[paste0(i,j)]] <- df %>% group_by(i,j,GENDER) %>% summarize(SCORE=mean(SCORE))
}}
As in "group_by()", the variables are classified as character and here is the error shown:
Error: Column I, J is unknown
Is there any method to force R to recognize the variable?
I am facing the same problem on the left_join of dplyr.
Error is shown when I was doing something like: left_join(x,y,by=c(i=i)) inside a loop.
You could get the data in long format and then calculate the mean
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = starts_with('GP')) %>%
group_by(GENDER ,LOC, name, value) %>%
summarise(SCORE = mean(SCORE))
# GENDER LOC name value SCORE
# <fct> <fct> <chr> <fct> <dbl>
# 1 F JP GP_A c 85
# 2 F JP GP_B e 80
# 3 F JP GP_B f 90
# 4 F UK GP_A b 30
# 5 F UK GP_B e 30
# 6 M HK GP_A a 60
# 7 M HK GP_B d 60
# 8 M UK GP_A a 80
# 9 M UK GP_A b 20
#10 M UK GP_B e 50
We can use melt from data.table
library(data.table)
melt(setDT(df), measure = patterns("^GP"))[, .(SCORE = mean(SCORE)),
.(GENDER, LOC, variable, value)]
data
df <- data.frame(GP_A,GP_B,GENDER,LOC,SCORE)

Speeding subsetting of data.frame by row based conditions avoiding loops (dplyr, R)

I am new to dplyr and I am struggling with what I believe is a simple function. I have a dataset similar to:
require(dplyr)
dat <- data.frame(t = rep(seq(1, 5, 1),4), id = rep(c(rep("A",5), rep("B",5), rep("C",5), rep("D",5)), 1),
x = 1:20, y = 51:70, h = c(rep(1,10), rep(0,10) ) )
dat <- arrange(dat, t)
dat <- data.frame(dat, group = c("B", "A", "A", "A", "A", "B", "C", "D", "A", "B", "D", "C", "A", "D", "C", "A", "A", "C", "C", "B") )
dat
I would like to attach a new column to the dataset dat containing the following operation:
for each row, for example row 3 with id == C, take the remaining rows such that their values in group is different to the starting id, that is C in this case
group the observations by time t
perform the following operation if the id (in this case the id C in row 3) has value 1 in col h: sum all the values (from the group based on t) in x and divide by the standard deviation of the values in y and x (from the group based on t). If id has a value of 0 in col h place a 0. If there are no observations the code should place a zero.
For example, for id A in row 1 the code should produce a 0 because all observations at time t == 1 have group == A. For id B in row 2 the code should produce (11 + 16) / sd(c(11, 16, 61, 66)).
How to perform this on dplyr or anyother way that does not include looping? Thank you.
The data looks like
dat
# t id x y h group
# 1 1 A 1 51 1 B
# 2 1 B 6 56 1 A
# 3 1 C 11 61 0 A
# 4 1 D 16 66 0 A
# 5 2 A 2 52 1 A
# 6 2 B 7 57 1 B
# 7 2 C 12 62 0 C
# 8 2 D 17 67 0 D
# 9 3 A 3 53 1 A
# 10 3 B 8 58 1 B
# 11 3 C 13 63 0 D
# 12 3 D 18 68 0 C
# 13 4 A 4 54 1 A
# 14 4 B 9 59 1 D
# 15 4 C 14 64 0 C
# 16 4 D 19 69 0 A
# 17 5 A 5 55 1 A
# 18 5 B 10 60 1 C
# 19 5 C 15 65 0 C
# 20 5 D 20 70 0 B
I tried the following but it does not produce the correct result.
dat %>%
group_by(t) %>%
mutate(new = ifelse(id != group, h * (sum(x) /map_dbl(row_number(), ~
sd(c(x[-.x], y[-.x]) ))) , 0) )
This should just illustrate speed performance of data.tables vs dplyr. I just took the whole ifelse of the mutate and packed it in a data.table operation and grouped with (by = t). So the results will not be the desired ones, but the results are at least the same for dplyr and data.tables.
library(data.table)
library(dplyr)
datDT <- data.table(dat)
DTF <- function(){
d <- datDT[ , new := ifelse( id != group, h * (sum(x) /
map_dbl(row_number(x), ~
sd(c(x[-.x], y[-.x])))) , 0) , by = t]
d
}
DPF <- function(){
d <- dat %>%
group_by(t) %>%
mutate(new = ifelse(id != group, h * (sum(x) /map_dbl(row_number(x), ~
sd(c(x[-.x], y[-.x]) ))) , 0) )
d
}
dtres = DTF()
dplres = DPF()
all.equal(dtres, data.table(dplres))
library(microbenchmark)
mc <- microbenchmark(times = 100,
DT = DTF(),
DPLYR = DPF()
)
mc
Unit: milliseconds
expr min lq mean median uq max neval cld
DT 7.428605 7.821919 8.324179 8.056762 8.429851 15.39028 100 a
DPLYR 11.154076 11.439025 11.895716 11.720050 12.139022 16.40934 100 b
The gain is not huge, but still noticeable and I'm sure there is still some optimization that can be done with setting keys, getting rid of the ifelse etc, but I leave that to the real data.table experts :).
So if you're new to both, maybe dig into data.tables, since you can also use dplyr-verbs with them (like below) and be slightly faster than with tbl structures.
dtres %>%
group_by(t) %>%
summarise(mN = mean(new))

Aggregate - Use more than one variable in user defined function

I'm working with a large dataset and doing some calculation with the aggregate() function.
This time I need to group by two different columns and for my calculation I need a user defined function that also uses two columns of the data.frame. That's where I'm stuck.
Here's an example data set:
dat <- data.frame(Kat = c("a","b","c","a","c","b","a","c"),
Sex = c("M","F","F","F","M","M","F","M"),
Val1 = c(1,2,3,4,5,6,7,8)*10,
Val2 = c(2,6,3,3,1,4,7,4))
> dat
Kat Sex Val1 Val2
a M 10 2
b F 20 6
c F 30 3
a F 40 3
c M 50 1
b M 60 4
a F 70 7
c M 80 4
Example of user defined function:
sum(Val1 * Val2) # but grouped by Kat and Sex
I tried this:
aggregate((dat$Val1),
by = list(dat$Kat, dat$Sex),
function(x, y = dat$Val2){sum(x*y)})
Output:
Group.1 Group.2 x
a F 1710
b F 600
c F 900
a M 300
b M 1800
c M 2010
But my expected output would be:
Group.1 Group.2 x
a F 610
b F 120
c F 90
a M 20
b M 240
c M 370
Is there any way to do this with aggregate()?
As #jogo suggested :
aggregate(Val1 * Val2 ~ Kat + Sex, FUN = sum, data = dat)
Or in a tidyverse style
library(dplyr)
dat %>%
group_by(Kat, Sex) %>%
summarize(sum(Val1 * Val2))
Or with data.table
library(data.table)
setDT(dat)
dat[ , sum(Val1 * Val2), by = list(Kat, Sex)]

Resources