Summary statistics from aggregated groups using data.table

Summary statistics from aggregated groups using data.table - r

I have a dataset with this structure:
library(data.table)
dt <- data.table(
record=c(1:20),
area=rep(LETTERS[1:4], c(4, 6, 3, 7)),
score=c(1,1:3,2:3,1,1,1,2,2,1,2,1,1,1,1,1:3),
cluster=c("X", "Y", "Z")[c(1,1:3,3,2,1,1:3,1,1:3,3,3,3,1:3)]
)
I would like to aggregate the data so I can identify the most common cluster in each area for a given score (for example 1). I would also like some basic frequencies and percentages to be calculated with an output looking something like this:
dt_summary_for_1_score <- data.table(
area=c("A","B","C","D"),
cluster_mode=c("X","X","X","Z"),
cluster_pct = c(100,66.6,100,80),
cluster_freq = c(2,2,1,4),
record_freq = c(2,3,1,5)
)
Ideally I would like a solution that uses data.table. Thanks.

I would leverage frank, though a solution with sort(table(cluster)) is possible as well.
dt_summary =
dt[ , .N, keyby = .(area, score, cluster)
][ , {
idx = frank(-N, ties.method = 'min') == 1
NN = sum(N)
.(
cluster_mode = cluster[idx],
cluster_pct = 100*N[idx]/NN,
cluster_freq = N[idx],
record_freq = NN
)
}, by = .(area, score)]
To get the example with score == 1 we can subset this:
dt_summary[score == 1]
# area score cluster_mode cluster_pct cluster_freq record_freq
# 1: A 1 X 100.00000 2 2
# 2: B 1 X 66.66667 2 3
# 3: C 1 X 100.00000 1 1
# 4: D 1 Z 80.00000 4 5
This returns different rows in the case of ties. You might try something like cluster_mode = paste(cluster[idx], collapse = '|') or cluster_mode = list(cluster[idx]) instead for alternatives.
Breaking down the logic:
# Count how many times each cluster shows up with each area/score
dt[ , .N, keyby = .(area, score, cluster)
][ , {
# Rank each cluster's count within each area/score & take the top;
# ties.method = 'min' guarantees that if there's
# a tie for "winner", _both_ will get rank 1
# (by default, ties.method = 'average')
# Note that it is slightly inefficient to negate N
# in order to sort in descending order, especially
# if there are a large number of groups. We could
# either vectorize negation by using -.N in the
# previous step or by using frankv (a lower-level
# version of frank) which has an 'order' argument
idx = frank(-N, ties.method = 'min') == 1
# calculate here since it's used twice
NN = sum(N)
.(
# use [idx] to subset and make sure there are
# only as many rows on output as there are
# top-ranked clusters for this area/score
cluster_mode = cluster[idx],
cluster_pct = 100*N[idx]/NN,
cluster_freq = N[idx],
record_freq = NN
)
}, by = .(area, score)]

Related

Why does cumsum function not work in data.table package when attempting chaining?

I'm trying to add a cumsum() column to a dataframe using the data.table package per the below code and it doesn't work in this case (see section marked # this doesn't work, after adding cumsum at the bottom:). I've used cumsum() before but as standalone function in setDT(), but now I'm trying to chain it to the code and it doesn't work. What am I doing wrong?
Here are the results I get for the code that does work (marked # this works, before cumsum: in the code below), except I added a column to the right "cumX" which is what I tried adding using chaining, whereby "cumX" runs a cumulative sum of column "1stStateX":
Period 1stStateX cumX
1: 1 0 0
2: 2 1 1
3: 3 1 2
4: 4 0 2
5: 5 0 2
In contrast, when I run the code marked # this doesn't work..., I get nothing back in the R studio console as illustrated here:
Also, when I run the code below marked # this works, before cumsum: for the simple sum() function (which works as intended), I noticed that the original dataframe testDF is altered as illustrated below. Why is that and how do I prevent that from happening?
Code:
library(data.table)
testDF <-
data.frame(
ID = c(rep(10,5),rep(50,5),rep(60,5)),
Period = c(1:5,1:5,1:5),
State = c("A","B","X","X","X",
"A","A","A","A","A",
"A","X","A","X","B")
)
# this works, before cumsum:
setDT(testDF)[
, `1stStateX` := .I == .I[State == 'X'][1],ID][
, .(`1stStateX` = sum(`1stStateX`, na.rm = TRUE)), by = Period]
# this doesn't work, after adding cumsum at the bottom:
setDT(testDF)[
, `1stStateX` := .I == .I[State == 'X'][1],ID][
, .(`1stStateX` = sum(`1stStateX`, na.rm = TRUE)), by = Period][
, cumX := cumsum(`1stStateX`),by = Period]
Final solution using ismirsehregal´s answer, and after studying the recommended vignette I got chaining to work:
library(data.table)
testDF <- data.frame(
ID = c(rep(10,5),rep(50,5),rep(60,5)),
Period = c(1:5,1:5,1:5),
State = c("A","B","X","X","X",
"A","A","A","A","A",
"A","X","A","X","B")
)
testDT <- testDF
testDT <- setDT(testDT)[, `1stStateX` := .I == .I[State == 'X'][1],ID][
, .(`1stStateX` = sum(`1stStateX`, na.rm = TRUE)), by = Period][
, cumX := cumsum(`1stStateX`)]
testDT <- as.data.frame(testDT)
print(testDT)

I don't know your expected output, however, I guess your confusion is the result of chaining code that modifies an existing data.table object (Please see ?`:=`) and code which results in a new data.table object (Please see section d) Select column(s) in j here).
Please check the following:
library(data.table)
testDF <- data.frame(
ID = c(rep(10,5),rep(50,5),rep(60,5)),
Period = c(1:5,1:5,1:5),
State = c("A","B","X","X","X",
"A","A","A","A","A",
"A","X","A","X","B")
)
DT <- copy(testDF)
setDT(DT)
DT[, `1stStateX` := .I == .I[State == 'X'][1], ID] # this step changes DT by reference, please see ?`:=`
aggregatedDT <- DT[, .(`1stStateX` = sum(`1stStateX`, na.rm = TRUE)), by = Period] # this does not change DT and results in another data.table
aggregatedDT[, cumX := cumsum(`1stStateX`)]
print(aggregatedDT)
Result:
Period 1stStateX cumX
1: 1 0 0
2: 2 1 1
3: 3 1 2
4: 4 0 2
5: 5 0 2

Data management: flatten data with R

I have the following dataframe gathering the evolution of policies:
Df <- data.frame(Id_policy = c("A_001", "A_002", "A_003","B_001","B_002"),
date_new = c("20200101","20200115","20200304","20200110","20200215"),
date_end = c("20200503","20200608","20210101","20200403","20200503"),
expend = c("","A_001","A_002","",""))
which looks like that:
Id_policy date_new date_end expend
A_001 20200101 20200503
A_002 20200115 20200608 A_001
A_003 20200304 20210101 A_002
B_001 20200110 20200403
B_002 20200215 20200503
"Id_policy" refers to a specific policy, "date_new" the date of policy issuance, "date_end" the date of policy end. However, sometimes a policy is extended. When it is the case, a new policy is set and the variable "expend" provides the name of the previous policy it changes.
The idea here is to flatten the dataset so we only keep rows corresponding to different policies. So, the output would be something like this:
Id_policy date_new date_end expend
A_001 20200101 20210101
B_001 20200110 20200403
B_002 20200215 20200503
Has-someone faced a similar problem ?

One way is to treat this as a network problem and use igraph functions (related posts e.g. Make a group_indices based on several columns
; Fast way to group variables based on direct and indirect similarities in multiple columns).
Set the missing 'expend' to 'Id_policy'
Use graph_from_data_frame to create a graph, where 'expend' and 'Id_policy' columns are treated as an edge list.
Use components to get connected components of the graph, i.e. which 'Id_policy' are connected, directly or indirectly.
Select the membership element to get "the cluster id to which each vertex belongs".
Join membership to original data.
Grab relevant data grouped by membership.
I use data.table for the data wrangling steps, but this can of course also be done in base or dplyr.
library(data.table)
library(igraph)
setDT(Df)
Df[expend == "", expend := Id_policy]
g = graph_from_data_frame(Df[ , .(expend, Id_policy)])
mem = components(g)$membership
Df[.(names(mem)), on = .(Id_policy), mem := mem]
Df[ , .(Id_policy = Id_policy[1],
date_new = first(date_new),
date_end = last(date_end), by = mem]
# mem Id_policy date_new date_end
# 1: 1 A_001 20200101 20210101
# 2: 2 B_001 20200110 20200403
# 3: 3 B_002 20200215 20200503

Here is a solution using igraph for creating a directed network of id's, and data.table to do some binding and joining.
I kept in between results to show what each step does.
library( data.table )
library( igraph )
setDT(Df)
#create nodes and links
nodes <- Df[,1:3]
links <- Df[ !expend == "", .(from = expend, to = Id_policy) ]
g = graph_from_data_frame( links, vertices = nodes, directed = TRUE )
plot(g)
#find nodes without incoming (these are startpoints of paths)
in.nodes <- V(g)[degree(g, mode = 'in') == 0]
#define sumcomponents of the graph by looping the in.nodes
L <- lapply( in.nodes, function(x) names( subcomponent(g, x) ) )
# $A_001
# [1] "A_001" "A_002" "A_003"
# $B_001
# [1] "B_001"
# $B_002
# [1] "B_002"
L2 <- lapply( L, function(x) {
#get first and last element
dt <- data.table( start = x[1], end = x[ length(x) ] )
})
#bind list together to a single data.table
ans <- rbindlist( L2, use.names = TRUE, fill = TRUE, idcol = "Id_policy" )
# Id_policy start end
# 1: A_001 A_001 A_003
# 2: B_001 B_001 B_001
# 3: B_002 B_002 B_002
#update ans with values from original Df for start and end
ans[ Df, `:=`( start = i.date_new ), on = .(start = Id_policy) ][]
ans[ Df, `:=`( end = i.date_end ), on = .(end = Id_policy) ][]
# Id_policy start end
# 1: A_001 20200101 20210101
# 2: B_001 20200110 20200403
# 3: B_002 20200215 20200503

An outer for loop to go through each policy id in Df with an inner while loop to find the last extension for an original policy should work
Df <- data.frame(Id_policy = c("A_001", "A_002", "A_003","B_001","B_002"),
date_new = c("20200101","20200115","20200304","20200110","20200215"),
date_end = c("20200503","20200608","20210101","20200403","20200503"),
expend = c("","A_001","A_002","",""),
stringsAsFactors = F)
final_df <- data.frame(matrix(nrow = 0, ncol = 0), stringsAsFactors = F)
for (i in seq_len(nrow(Df))) {
# Check to see if the current policy ID is in the column expend
if (Df$Id_policy[i] %in% Df$expend || !Df$expend[i] == "") {
# Loop through expend policy until last one is found
found_last <- F
j <- i
end_date <- ""
c_policy_id <- Df$Id_policy[j]
expended_id <- Df$Id_policy[which(Df$expend == c_policy_id)]
if (length(expended_id) > 0) {
if (expended_id %in% Df$expend) {
while(!found_last) {
c_policy_id <- Df$Id_policy[j]
expended_id <- Df$Id_policy[which(Df$expend == c_policy_id)]
if (length(expended_id) > 0) {
if (expended_id %in% Df$expend) {
j <- which(Df$expend == expended_id)
}
}else{
end_date <- Df$date_end[j]
found_last <- T
}
}
if (!end_date == "") {
# Add to final df when found the last one
final_df <- bind_rows(final_df, data.frame(Id_policy = Df$Id_policy[i],
date_new = Df$date_new[i],
date_end = end_date,
expend = ""))
}
}
}
}else{
final_df <- bind_rows(final_df, Df[i, ])
}
}
final_df
Id_policy date_new date_end expend
1 A_001 20200101 20210101
2 B_001 20200110 20200403
3 B_002 20200215 20200503

dplyr into data.table: filter > group by > count

I usually work with dplyr but face a rather large data set and my approach is very slow. I basically need to filter a df group it by dates and count the occurrence within
sample data (turned already everything into data.table)
library(data.table)
library(dplyr)
set.seed(123)
df <- data.table(startmonth = seq(as.Date("2014-07-01"),as.Date("2014-11-01"),by="months"),
endmonth = seq(as.Date("2014-08-01"),as.Date("2014-12-01"),by="months")-1)
df2 <- data.table(id = sample(1:10, 5, replace = T),
start = sample(seq(as.Date("2014-07-01"),as.Date("2014-10-01"),by="days"),5),
end = df$startmonth + sample(10:90,5, replace = T)
)
#cross joining
res <- setkey(df2[,c(k=1,.SD)],k)[df[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]
My dplyr approach works but is slow
res %>% filter(start <=endmonth & end>= startmonth) %>%
group_by(startmonth,endmonth) %>%
summarise(countmonth=n())
My data.table knowledge is limited but I guess we would setkeys() on the date columns and something like res[ , :=( COUNT = .N , IDX = 1:.N ) , by = startmonth, endmonth] to get the counts by group but I'm not sure how the filter goes in there.
Appreciate your help!

You could do the counting inside the join:
df2[df, on=.(start <= endmonth, end >= startmonth), allow.cartesian=TRUE, .N, by=.EACHI]
start end N
1: 2014-07-31 2014-07-01 1
2: 2014-08-31 2014-08-01 4
3: 2014-09-30 2014-09-01 5
4: 2014-10-31 2014-10-01 3
5: 2014-11-30 2014-11-01 3
or add it as a new column in df:
df[, n :=
df2[.SD, on=.(start <= endmonth, end >= startmonth), allow.cartesian=TRUE, .N, by=.EACHI]$N
]
How it works. The syntax is x[i, on=, allow.cartesian=, j, by=.EACHI]. Each row if i is used to look up values in x. The symbol .EACHI indicates that aggregation (j=.N) will be done for each row of i.

Find nearest value by group

I am looking for a practical way to retrieve the nearest value to 0 for each group using (preferably) data.table.
Assume the following DT:
set.seed(1)
library(data.table)
DT <- data.table(val = rnorm(1000), group = rep(1:10, each = 10)) # 10 groups
I have tried to combine both by = group and roll = "nearest", but it only returns the nearest value across and not by groups:
DT[val == 0, val, by = group, roll = "nearest"]
# group value
#1: 8 0.001105352
I could of course repeat the process for each group, but it would be impractical as the number of groups increases. E.g.:
res <- rbind(DT[val == 0 & group = 1, val, by = group, roll = "nearest"],
DT[val == 0 & group = 2, val, by = group, roll = "nearest"],
DT[val == 0 & group = 3, val, by = group, roll = "nearest"],
...)
Maybe I am missing some data.table feature?

You don't necessarily need a join for that.
A possible solution using a combination of min and abs:
DT[, .(closest.val.to.zero = val[abs(val) == min(abs(val))]), by = group]
which gives:
group closest.val.to.zero
1: 1 0.011292688
2: 2 -0.016190263
3: 3 0.002131860
4: 4 0.004398704
5: 5 0.017395620
6: 6 0.002415809
7: 7 0.004884450
8: 8 0.001105352
9: 9 -0.040150452
10: 10 -0.010925691
A more generalised way of the option as posted by #chinsoon12 in the comments:
DT[CJ(group = group, val = 0, unique = TRUE)
, on = .(group, val)
, .(group, closest.val.to.zero = x.val)
, roll = "nearest"]

Join 2 data frames using data.table with conditions

I have these two data frames:
set.seed(42)
A <- data.table(station = sample(1:10, 1000, replace=TRUE),
hash = sample(letters[1:5], 1000, replace=TRUE),
point = sample(1:24, 1000, replace=TRUE))
B <- data.table(station = sample(1:10, 100, replace=TRUE),
card = sample(letters[6:10], 100, replace=TRUE),
point = sample(1:24, 100, replace=TRUE))
Dataframe A contains more than 1M rows.
I try to find hash (from A) for each card (from B). I have some conditions there: stations and points in A lays in a range(for station +- 1 and for points just + 2).
I use grouping B by card and execute for each group function for binding rows after implementing such conditions and get max by freq.
detect <- function(x){
am0 <- data.frame(station = 0,
hash = 0,
point = 0)
for (i in 1:nrow(x)) {
am1 <- A %>%
filter(station %in% (B$station[i] - 1) : (B$station[i] + 1) &
point > B$point[i] & point < B$point[i] + 2)
am0 <- rbind(am0, am1)
}
t <- as.data.frame(table(am0$hash))
t <- t %>%
arrange(-Freq) %>%
filter(row_number() == 1)
return(t)
}
And then just:
library(dplyr)
B %>%
group_by(card) %>%
do(detect(.)) %>%
ungroup
But I don't know how to implement function by each group with indices [i] so I actually get a wrong result.
# A tibble: 5 x 3
card Var1 Freq
<chr> <fctr> <int>
1 f c 46
2 g c 75
3 h c 41
4 i c 64
5 j c 62
I`m a beginner but I know best solution for big datasets - using data.table library for join 2 datasets like these. Can you help me to find decision for it?

I think what you want to do is:
#### Prepare join limits
B[, point_limit := as.integer(point + 2)]
B[, station_lower := as.integer(station - 1)]
B[, station_upper := as.integer(station + 1)]
## Join A on B, creates All combinations of points in A and B fulfilling the conditions
joined_table <- B[A,
, on = .( point_limit >= point, point <= point,
station_lower <= station, station_upper >= station),
nomatch = 0,
allow.cartesian=TRUE]
## Count the occurrences of the combinations
counted_table <- joined_table[,.N, by=.(card,hash)][order(card, -N)]
## Select the top for each group.
counted_table[, head(.SD, 1 ),by = .(card)][order(card)]
This will create a full table with all the information in and then do the counting on that. It relies purely on data.tables since to fully take advantage of the speed gains from that package. The data.table vignette is good if you are unfamiliar with the syntax. The nomatch condition ensures that we are doing an inner join.
This will probably be fine if A is only 1M rows and B is kept the same size, depending on your datas distribution. We can however split B also in a similar way to your do statement using the package purrr. I'm not sure how this interacts with R:s garabage collection however.
frame_list <- purrr::map(unique(B$card),
~ B[card == .x][A,
, on = .(point_limit >= point,
point <= point,
station_lower <= station,
station_upper >= station),
nomatch = 0,
allow.cartesian = TRUE][, .N, by = .(card, hash)])
counted_table_mem <- rbindlist(frame_list )
Something to note in this is that I use, rbindlist instead of multiple rbind. Repeatedly calling rbind will be very slow, since you will need to allocate new memory each time.