mutate() with an if/else function

mutate() with an if/else function - r

I have an example dataframe
df <- data.frame(cust = sample(1:100, 1000, TRUE),
channel = sample(c("WEB", "POS"), 1000, TRUE))
that I'm trying to mutate
get_channels <- function(data) {
d <- data
if(unique(d) %>% length() == 2){
d <- "Both"
} else {
if(unique(d) %>% length() < 2 && unique(d) == "WEB") {
d <- "Web"
} else {
d <- "POS"
}
}
return(d)
}
This works without issue and on small dataframes, it takes no time at all.
start.time <- Sys.time()
df %>%
group_by(cust) %>%
mutate(chan = get_channels(channel)) %>%
group_by(cust) %>%
slice(1) %>%
group_by(chan) %>%
summarize(count = n()) %>%
mutate(perc = count/sum(count))
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
Time difference of 0.34602 secs
However, when the data frame gets rather large, say, on the order of >1000000 or more cust, my basic if/else fx takes much, much longer.
How can I streamline this function to make it run more quickly?

You should use a data.table for this.
setDT(df)
t1 = Sys.time()
df = df[ , .(channels = ifelse(uniqueN(channel) == 2, "both", as.character(channel[1]))), by = .(cust)]
> Sys.time() - t1
Time difference of 0.00500083 secs
> head(df)
cust channels
1: 37 both
2: 45 both
3: 74 both
4: 20 both
5: 1 both
6: 68 both

You can do it in base R using something like that:
web_cust <- unique(df$cust[df$channel=="WEB"])
pos_cust <- unique(df$cust[df$channel=="POS"])
both <- length(intersect(web_cust, pos_cust))
web_only <- length(setdiff(web_cust, pos_cust))
pos_only <- length(setdiff(pos_cust, web_cust))
Data:
set.seed(1)
df <- data.frame(cust = sample(2e6, 1e7, TRUE),
channel = sample(c("WEB", "POS"), 1e7, TRUE),
stringsAsFactors = F)

A faster dplyr version that takes about 1/3 the time but is probably still slower than the data table version. uniqueN borrowed from #Kristoferson answer.
df %>%
group_by(cust) %>%
summarize(chan = if_else(uniqueN(channel) == 2, "Both", as.character(channel[1]))) %>%
group_by(chan) %>%
summarize(n = n() ) %>%
mutate(perc = n /sum(n))
Also, your orginal can be improved significantly by optimizing your function like this:
get_channels <- function(data) {
ud <- unique(data)
udl <- length(ud)
if(udl == 2) {
r <- "Both"
} else {
if(udl < 2 && ud == "WEB") {
r <- "Web"
} else {
r <- "POS"
}
}
return(r)
}

And some timings...
I tried three different alternatives in both dplyr and data.table: (1) ifelse (see #Kristofersen's answer), (2) if / else (because the test is of length 1), and (3) vector indexing. Unsurprisingly, the main difference is between dplyr and data.table and not among alternative 1-3.
For 1000 customers, data.table is about 7 times faster. For 10000 customers it's about 30 times faster. For 1e6 customers, I only tested data.table, not a very large difference between alternatives.
# 1000 customers, 2*1000 registrations
df <- data.frame(cust = sample(1e3, 2e3, replace = TRUE),
channel = sample(c("WEB", "POS"), 2e3, TRUE))
library(microbenchmark)
library(dplyr)
library(data.table)
microbenchmark(dp1 = df %>%
group_by(cust) %>%
summarise(res = ifelse(n_distinct(channel) == 1, channel[1], "both")),
dp2 = df %>%
group_by(cust) %>%
summarise(res = if(n_distinct(channel) == 1) channel[1] else "both"),
dp3 = df %>%
group_by(cust) %>%
summarise(res = c("both", channel[1])[(n_distinct(channel) == 1) + 1]),
dt1 = setDT(df)[ , .(channels = ifelse(uniqueN(channel) == 2, "both", channel[1])), by = cust],
dt2 = setDT(df)[ , .(channels = if(uniqueN(channel) == 2) "both" else channel[1]), by = cust],
dt3 = setDT(df)[ , .(res = c("both", channel[1])[(uniqueN(channel) == 1) + 1]), by = cust],
times = 5, unit = "relative")
# 1e3 customers
# Unit: relative
# expr min lq mean median uq max neval
# dp1 7.8985477 8.176139 7.9355234 7.676534 8.0359975 7.9166933 5
# dp2 7.8882707 8.018000 7.8965098 8.731935 7.8414478 7.3560530 5
# dp3 8.0851402 8.934831 7.7540060 7.653026 6.8305012 7.6887950 5
# dt1 1.1713088 1.180870 1.0350482 1.209861 1.0523597 0.7650059 5
# dt2 0.8272681 1.223387 0.9311628 1.047773 0.9028017 0.7795579 5
# dt3 1.0000000 1.000000 1.0000000 1.000000 1.0000000 1.0000000 5
# 1e4 customers
# Unit: relative
# expr min lq mean median uq max neval
# dp1 40.8725204 39.5297108 29.5755838 38.996075 38.246103 17.2784642 5
# dp2 40.7396141 39.4299918 27.4476811 38.819577 37.886320 12.7265756 5
# dp3 41.0940358 39.7819673 27.5532964 39.260488 38.317899 12.4685386 5
# dt1 1.0905470 1.0661613 0.7422082 1.053786 1.034642 0.3428945 5
# dt2 0.9052739 0.9008761 1.2813458 2.111642 2.356008 0.9005391 5
# dt3 1.0000000 1.0000000 1.0000000 1.000000 1.000000 1.0000000 5
# 1e6 customers, data.table only
# Unit: relative
# expr min lq mean median uq max neval
# dt1 1.146757 1.147152 1.155497 1.164471 1.156244 1.161660 5
# dt2 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 5
# dt3 1.084442 1.079734 1.253568 1.106833 1.098766 1.799935 5

Related

Sum over 7 day rolling window in R [duplicate]

I want to get the rolling 7-day sum by ID. Suppose my data looks like this:
data<-as.data.frame(matrix(NA,42,3))
data$V1<-seq(as.Date("2014-05-01"),as.Date("2014-09-01"),by=3)
data$V2<-rep(1:6,7)
data$V3<-rep(c(1,2),21)
colnames(data)<-c("Date","USD","ID")
Date USD ID
1 2014-05-01 1 1
2 2014-05-04 2 2
3 2014-05-07 3 1
4 2014-05-10 4 2
5 2014-05-13 5 1
6 2014-05-16 6 2
7 2014-05-19 1 1
8 2014-05-22 2 2
9 2014-05-25 3 1
10 2014-05-28 4 2
How can I add a new column that will contain the rolling 7-day sum by ID?

If your data is big, you might want to check out this solution which uses data.table. It is pretty fast. If you need more speed, you can always change mapply to mcmapply and use multiple cores.
#Load data.table and convert to data.table object
require(data.table)
setDT(data)[,ID2:=.GRP,by=c("ID")]
#Build reference table
Ref <- data[,list(Compare_Value=list(I(USD)),Compare_Date=list(I(Date))), by=c("ID2")]
#Use mapply to get last seven days of value by id
data[,Roll.Val := mapply(RD = Date,NUM=ID2, function(RD, NUM) {
d <- as.numeric(Ref$Compare_Date[[NUM]] - RD)
sum((d <= 0 & d >= -7)*Ref$Compare_Value[[NUM]])})]

Dataset provided by OP does not expose the complexity of the task. In terms of addressing OP question so far only Mike's answer was the correct one.
In fact for a 8 rolling days, instead of 7 rolling days, due to d <= 0 & d >= -7.
zoo solution by #G. Grothendieck is almost valid, only if merge would be made to each group of ID.
Below second data.table solution, this time valid results, using dev RcppRoll which allows na.rm=TRUE.
And slightly formatted Mike's solution output.
data<-as.data.frame(matrix(NA,42,3))
data$V1<-seq(as.Date("2014-05-01"),as.Date("2014-09-01"),by=3)
data$V2<-rep(1:6,7)
data$V3<-rep(c(1,2),21)
colnames(data)<-c("Date","USD","ID")
library(microbenchmark)
library(RcppRoll) # install_github("kevinushey/RcppRoll")
library(data.table) # install_github("Rdatatable/data.table")
correct_jan_dt = function(n, partial=TRUE){
DT = as.data.table(data) # this can be speedup by setDT()
date.range = DT[,range(Date)]
all.dates = seq.Date(date.range[1],date.range[2],by=1)
setkey(DT,ID,Date)
r = DT[CJ(unique(ID),all.dates)][, c("roll") := as.integer(roll_sumr(USD, n, normalize = FALSE, na.rm = TRUE)), by="ID"][!is.na(USD)]
# This could be simplified when `partial` arg will be implemented in [kevinushey/RcppRoll](https://github.com/kevinushey/RcppRoll)
if(isTRUE(partial)){
r[is.na(roll), roll := cumsum(USD), by="ID"][]
}
return(r[order(Date,ID)])
}
correct_mike_dt = function(){
data = as.data.table(data)[,ID2:=.GRP,by=c("ID")]
#Build reference table
Ref <- data[,list(Compare_Value=list(I(USD)),Compare_Date=list(I(Date))), by=c("ID2")]
#Use mapply to get last seven days of value by id
data[, c("roll") := mapply(RD = Date,NUM=ID2, function(RD, NUM){
d <- as.numeric(Ref$Compare_Date[[NUM]] - RD)
sum((d <= 0 & d >= -7)*Ref$Compare_Value[[NUM]])})][,ID2:=NULL][]
}
identical(correct_mike_dt(), correct_jan_dt(n=8,partial=TRUE))
# [1] TRUE
microbenchmark(unit="relative", times=5L, correct_mike_dt(), correct_jan_dt(8))
# Unit: relative
# expr min lq mean median uq max neval
# correct_mike_dt() 274.0699 273.9892 267.2886 266.6009 266.2254 256.7296 5
# correct_jan_dt(8) 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 5
Looking forward for update from #Khashaa.
Edit (20150122.2): Below benchmarks do not answer OP question.
Timing on a bigger (still very tiny) dataset, 5439 rows:
library(zoo)
library(data.table)
library(dplyr)
library(RcppRoll)
library(microbenchmark)
data<-as.data.frame(matrix(NA,5439,3))
data$V1<-seq(as.Date("1970-01-01"),as.Date("2014-09-01"),by=3)
data$V2<-sample(1:6,5439,TRUE)
data$V3<-sample(c(1,2),5439,TRUE)
colnames(data)<-c("Date","USD","ID")
zoo_f = function(){
z <- read.zoo(data)
z0 <- merge(z, zoo(, seq(start(z), end(z), "day")), fill = 0) # expand to daily
roll <- function(x) rollsumr(x, 7, fill = NA)
transform(data, roll = ave(z0$USD, z0$ID, FUN = roll)[time(z)])
}
dt_f = function(){
DT = as.data.table(data) # this can be speedup by setDT()
date.range = DT[,range(Date)]
all.dates = seq.Date(date.range[1],date.range[2],by=1)
setkey(DT,Date)
DT[.(all.dates)
][order(Date), c("roll") := rowSums(setDT(shift(USD, 0:6, NA, "lag")),na.rm=FALSE), by="ID"
][!is.na(ID)]
}
dp_f = function(){
data %>% group_by(ID) %>%
mutate(roll=roll_sum(c(rep(NA,6), USD), 7))
}
dt2_f = function(){
# this can be speedup by setDT()
as.data.table(data)[, c("roll") := roll_sum(c(rep(NA,6), USD), 7), by="ID"][]
}
identical(as.data.table(zoo_f()),dt_f())
# [1] TRUE
identical(setDT(as.data.frame(dp_f())),dt_f())
# [1] TRUE
identical(dt2_f(),dt_f())
# [1] TRUE
microbenchmark(unit="relative", times=20L, zoo_f(), dt_f(), dp_f(), dt2_f())
# Unit: relative
# expr min lq mean median uq max neval
# zoo_f() 140.331889 141.891917 138.064126 139.381336 136.029019 137.730171 20
# dt_f() 14.917166 14.464199 15.210757 16.898931 16.543811 14.221987 20
# dp_f() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20
# dt2_f() 1.536896 1.521983 1.500392 1.518641 1.629916 1.337903 20
Yet I'm not sure if my data.table code is already optimal.
Above functions did not answer OP question. Read the top of post for update. Mike's solution was the correct one.

1) Assuming you mean every successive overlapping 7 rows for that ID:
library(zoo)
transform(data, roll = ave(USD, ID, FUN = function(x) rollsumr(x, 7, fill = NA)))
2) If you really did mean 7 days and not 7 rows then try this:
library(zoo)
z <- read.zoo(data)
z0 <- merge(z, zoo(, seq(start(z), end(z), "day")), fill = 0) # expand to daily
roll <- function(x) rollsumr(x, 7, fill = NA)
transform(data, roll = ave(z0$USD, z0$ID, FUN = roll)[time(z)])
Updated Added (2) and made some improvements.

library(data.table)
data <- data.table(Date = seq(as.Date("2014-05-01"),
as.Date("2014-09-01"),
by = 3),
USD = rep(1:6, 7),
ID = rep(c(1, 2), 21))
data[, Rolling7DaySum := {
d <- data$Date - Date
sum(data$USD[ID == data$ID & d <= 0 & d >= -7])
},
by = list(Date, ID)]

I found that there is some problem with Mike.Gahan's suggested code and correct it as below after testing it out.
require(data.table)
setDT(data)[,ID2:=.GRP,by=c("ID")]
Ref <-data[,list(Compare_Value=list(I(USD)),Compare_Date=list(I(Date))),by=c("ID2")]
data[,Roll.Val := mapply(RD = Date,NUM=ID2, function(RD, NUM) {
d <- as.numeric(Ref[ID2 == NUM,]$Compare_Date[[1]] - RD)
sum((d <= 0 & d >= -7)*Ref[ID2 == NUM,]$Compare_Value[[1]])})]

Does Column of Sentences Contain Word in Another Column of Sentences?

I have two large tables each containing a "sentence" column with a string of words. I am curious which records (true/false output) have a word that is found in any of the sentences in either column. My tables are very large and the below code I have can take a very long time. Is there a faster way to go about doing this?
Thank you!
# Determine if any "words" in either column of sentences match.
# Packages
library(tidyverse)
# Help functions
helper_in_2 <- function(b, a){
return(any(b %in% a))
}
helper_in <- function(a, b){
return(lapply(b, helper_in_2, a))
}
# Sample columns
sentence_col_a <- c("This is an example sentence.", "Here is another sample sentence?", "One more sentence that is not complicated.", "Last sentence to show an example!")
sentence_col_b <- c("Short string A.", "Another longer string.", "Final string example!")
# Extract words from each column
list_col_a <- str_to_lower(sentence_col_a) %>%
str_extract_all("[:alpha:]+")
list_col_b <- str_to_lower(sentence_col_b) %>%
str_extract_all("[:alpha:]+")
# Check for matches.
# (Code after first line isn't actually used in my code - it's just to show matches)
sapply(lapply(list_col_a, helper_in, list_col_b), as.numeric) %>%
t() %>%
as.data.frame() %>%
rename_at(vars(names(.)), function(x) sentence_col_b) %>%
mutate(rownames = sentence_col_a) %>%
tibble::column_to_rownames(var = "rownames")
Output:
Sentences
Short string A.
Another longer string.
Final string example!
This is an example sentence.
0
0
1
Here is another sample sentence?
0
1
0
One more sentence that is not complicated.
0
0
0
Last sentence to show an example!
0
0
1
Update after Ronak's Answer
library(microbenchmark)
microbenchmark("Original method:" = sapply(lapply(list_col_a, helper_in, list_col_b), as.numeric),
"Ronak's method:" = sapply(list_col_a, function(x) as.integer(grepl(sprintf('\\b(%s)\\b', paste0(x, collapse = '|')), list_col_b))))
#Unit: microseconds
# expr min lq mean median uq max neval
#Original method: 72.9 76.65 88.082 82.35 86.1 173.9 100
# Ronak's method: 262.1 277.40 354.741 286.40 348.6 3724.3 100

Here I can provide several options, but the nested for-loop method might be the most efficient one so far:
outer
TIC1 <- function() {
+outer(list_col_a, list_col_b, FUN = Vectorize(function(x, y) any(x %in% y)))
}
nested sapply
TIC2 <- function() {
sapply(
list_col_b,
function(x) {
sapply(
list_col_a,
function(y) sum(y %in% x)
)
}
)
}
nested for loops
TIC3 <- function() {
res <- matrix(nrow = length(list_col_a), ncol = length(list_col_b))
for (a in seq_along(list_col_a)) {
for (b in seq_along(list_col_b)) {
res[a, b] <- any(list_col_a[[a]] %in% list_col_b[[b]])
}
}
+res
}
Benchmarking
# Original solution
original <- function() {
sapply(lapply(list_col_a, helper_in, list_col_b), as.numeric) %>%
t() %>%
as.data.frame() %>%
rename_at(vars(names(.)), function(x) sentence_col_b) %>%
mutate(rownames = sentence_col_a) %>%
tibble::column_to_rownames(var = "rownames")
}
# Waldi's data.table solution
Waldi <- function() {
la <- data.table(id = 1:length(list_col_a), list_col_a)
lb <- data.table(id = 1:length(list_col_b), list_col_b)
la_long <- la[, .(words = unlist(list_col_a)), by = id]
lb_long <- lb[, .(words = unlist(list_col_b)), by = id]
unique(la_long[lb_long, on = .(words = words)][!is.na(id), .(idxa = id, idxb = i.id)])
}
mustafaakben1 <- function(rows = list_col_a, cols = list_col_b) {
to_matrix <- function(X_t) {
matrix(unlist(X_t),
nrow = length(list_col_a),
ncol = length(list_col_b),
byrow = T
)
}
to_matrix(lapply(
1:length(cols),
FUN = function(X) {
lapply(
X = 1:length(rows),
FUN = function(Y) {
sum(rows[[Y]] %in% cols[[X]])
}
)
}
))
}
library(fastmatch)
mustafaakben2 <- function() {
search_keywords <- unlist(list_col_b)[unlist(list_col_b) %in% unlist(list_col_a)]
b_col_filter <- which(unlist(lapply(list_col_b, function(X) any(X %in% search_keywords))))
a_row_filter <- which(unlist(lapply(list_col_a, function(X) any(X %in% search_keywords))))
res <- matrix(0,
nrow = length(list_col_a),
ncol = length(list_col_b)
)
for (a in a_row_filter) {
for (b in b_col_filter) {
res[a, b] <- any(list_col_a[[a]] %fin% list_col_b[[b]])
}
}
+res
}
# ThomasIsCoding's outer solution
TIC1 <- function() {
+outer(list_col_a, list_col_b, FUN = Vectorize(function(x, y) any(x %in% y)))
}
TIC2 <- function() {
sapply(
list_col_b,
function(x) {
sapply(
list_col_a,
function(y) sum(y %in% x)
)
}
)
}
TIC3 <- function() {
res <- matrix(nrow = length(list_col_a), ncol = length(list_col_b))
for (a in seq_along(list_col_a)) {
for (b in seq_along(list_col_b)) {
res[a, b] <- any(list_col_a[[a]] %in% list_col_b[[b]])
}
}
+res
}
microbenchmark::microbenchmark(
original(),
Waldi(),
mustafaakben1(),
mustafaakben2(),
TIC1(),
TIC2(),
TIC3(),
unit = "relative"
)
and you will see
Unit: relative
expr min lq mean median uq max
original() 172.895884 149.346066 49.841448 142.676077 134.111459 3.130206
Waldi() 107.441122 92.004380 30.290680 88.474026 83.690267 1.971249
mustafaakben1() 1.596981 1.551978 1.646884 1.610160 1.553021 1.683034
mustafaakben2() 1.635812 1.731991 2.186106 1.912535 1.831179 2.332797
TIC1() 3.854043 3.845866 1.977066 3.943445 3.707308 1.041416
TIC2() 2.888118 2.627955 1.607401 2.719427 2.538536 1.142211
TIC3() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
neval
100
100
100
100
100
100
100

Try nested lappy
to_matrix <- function(X_t){
matrix(unlist(X_t),
nrow = length(list_col_a),
ncol = length(list_col_b)),
byrow = T)
}
nested_lappy <- function(rows=list_col_a, cols=list_col_b) {
to_matrix(lapply(
1:length(cols),
FUN = function (X)
lapply(
X = 1:length(rows),
FUN = function(Y)
sum(rows[[Y]] %in% cols[[X]])
)
))
}
> nested_lappy()
[,1] [,2] [,3]
[1,] 0 0 1
[2,] 0 1 0
[3,] 0 0 0
[4,] 0 0 1
Probably your matrix will be huge too. It would be better to use a sparse matrix. You can try to use the Matrix package. It may help you to carry out your analysis in a more memory-efficient way.
Here is the benchmark
microbenchmark::microbenchmark(
original(),
Waldi(),
TIC(),
nested_lappy(),
unit = "relative"
)
Unit: relative
expr min lq mean median uq max neval
original() 99.97881 89.869163 83.011249 67.88434 69.883301 260.704657 100
Waldi() 56.55076 51.185905 45.436361 39.35327 42.730942 46.438114 100
TIC() 2.27000 2.249311 1.986625 1.84108 1.837013 3.974149 100
nested_lappy() 1.00000 1.000000 1.000000 1.00000 1.000000 1.000000 100
Edits
I will cheat here a little bit because #ThomasIsCoding is an amazing coder. I need to cheat :)
So, because you have a huge table, you need to focus on an efficient way to search your keyword space. As you may notice that not all keywords have an intersection and shared uniformly in the sentences. So, even before starting to search, you can eliminate those sentences in the search space. By doing so, we can focus only on the words shared by both column and row dimensions.
search_keywords<- unlist(list_col_b)[unlist(list_col_b) %in% unlist(list_col_a)]
b_col_filter <- which(unlist(lapply(list_col_b, function(X) any(X %in% search_keywords))))
a_row_filter <- which(unlist(lapply(list_col_a, function(X) any(X %in% search_keywords))))
Then, use the fastmatch package to make the %in% faster.
library(fastmatch)
mustafaakben2 <- function() {
res <- matrix(0,
nrow = length(list_col_a),
ncol = length(list_col_b))
for (a in a_row_filter) {
for (b in b_col_filter) {
res[a, b] <- any(list_col_a[[a]] %fin% list_col_b[[b]])
}
}
+res
}
> mustafaakben2()
[,1] [,2] [,3]
[1,] 0 0 1
[2,] 0 1 0
[3,] 0 0 0
[4,] 0 0 1
Benchmark results
microbenchmark::microbenchmark(
original(),
Waldi(),
TIC1(),
TIC2(),
TIC3(),
mustafaakben(),
mustafaakben2(),
unit = "relative"
)
Unit: relative
expr min lq mean median uq max neval cld
original() 288.620155 254.429012 193.446439 190.457965 171.914286 115.0415822 100 c
Waldi() 182.751938 153.864198 115.182908 115.778761 103.518095 36.9411765 100 b
TIC1() 6.581395 6.277778 5.074523 5.066372 4.685714 2.3732252 100 a
TIC2() 4.705426 4.385802 3.503269 3.466814 3.281905 1.5811359 100 a
TIC3() 1.767442 1.685185 1.360847 1.338496 1.249524 0.7728195 100 a
mustafaakben() 2.589147 2.330247 1.944260 2.017699 1.864762 0.7322515 100 a
mustafaakben2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100 a

You could use a data.table join to get the id of the matching sentences.
library(data.table)
# Original solution
original <- function(){
sapply(lapply(list_col_a, helper_in, list_col_b), as.numeric) %>%
t() %>%
as.data.frame() %>%
rename_at(vars(names(.)), function(x) sentence_col_b) %>%
mutate(rownames = sentence_col_a) %>%
tibble::column_to_rownames(var = "rownames")
}
# data.table solution
new <- function(){
la <- data.table(id = 1:length(list_col_a),list_col_a)
lb <- data.table(id = 1:length(list_col_b),list_col_b)
la_long <- la[,.(words=unlist(list_col_a)),by= id]
lb_long <- lb[,.(words=unlist(list_col_b)),by= id]
unique(la_long[lb_long, on=.(words=words)][!is.na(id),.(idxa=id, idxb = i.id)])
}
new()
idxa idxb
1: 2 2
2: 1 3
3: 4 3
microbenchmark::microbenchmark(original(),new())
Unit: milliseconds
expr min lq mean median uq max neval cld
original() 4.1623 5.1190 5.857155 5.5528 6.18345 23.5442 100 b
new() 2.2492 2.7993 3.255741 3.1298 3.68645 5.1872 100 a
As data.table allows indexing, this could be much more efficient on a higher number of sentences / words : to be tested on a bigger dataset.

With the help of regular expressions, you can do this with one sapply call. We create a pattern with each value in list_col_a and check if any of it exists in list_col_b.
sapply(list_col_a, function(x) as.integer(grepl(sprintf('\\b(%s)\\b',
paste0(x, collapse = '|')), list_col_b)))
# [,1] [,2] [,3] [,4]
#[1,] 0 0 0 0
#[2,] 0 1 0 0
#[3,] 1 0 0 1
You can include your remaining code as it is to get the matches.

R - Calculating differences by group for all cuts of data

I have a dataset with several attributes and a value.
Input (sample)
GRP CAT TYP VAL
X H 5 0.76
X A 2 0.34
X D 3 0.70
X I 3 0.33
X F 4 0.80
X E 1 0.39
I want to:
Determine all combinations of CAT and TYP
For each combination, calculate the average value when the combination is removed
Return a final table of differences
Final Table (sample)
CAT TYP DIFF
1 <NA> NA 0.04000
2 H NA 0.03206
Row 1 means that if no records are removed, the difference between the average value of GRP='X' and GRP='Y' is 0.04. Row 2 means that if records with CAT='H' are removed, the difference is 0.032.
I have working code, but I want to make it faster. I'm open to your suggestions.
Working Code
library(dplyr)
set.seed(777)
# build example data frame
df <- data.frame(GRP = c(rep('X',25),rep('Y',25)),
CAT = sample(LETTERS[1:10], 50, T),
TYP = sample(1:5, 50, T),
VAL = sample(1:100, 50, T)/100,
stringsAsFactors = F)
# table of all combinations of CAT and TYP
splits <- expand.grid(lapply(df[,-c(1,4)], function(x) c(NA, unique(x))), stringsAsFactors = F)
# null data frame to store results
ans <- data.frame(CAT = character(),
TYP = integer(),
DIFF = numeric(),
stringsAsFactors = F)
# loop through each combination and calculate the difference between group X and Y
for(i in 1:nrow(splits)) {
split.i <- splits[i,]
# determine non-na columns
by.cols <- colnames(split.i)[unlist(lapply(split.i, function(x) !all(is.na(x))))]
# anti-join to remove records that match `split.i`
if(length(by.cols) > 0){
df.i <- df %>%
anti_join(split.i, by = by.cols)
} else {
df.i <- df
}
# calculate average by group
df.i <- df.i %>%
group_by(GRP) %>%
summarize(VAL_MEAN = mean(VAL))
# calculate difference of averages
DIFF <- df.i[,2] %>%
as.matrix() %>%
diff() %>%
as.numeric()
ans.tmp <- cbind(split.i, DIFF)
# bind to final data frame
ans <- bind_rows(ans, ans.tmp)
}
return(ans)
Speed results
> system.time(fcnDiffCalc())
user system elapsed
0.30 0.02 0.31

Consider assigning DIFF column with sapply rather than growing a data frame in a loop to avoid the repetitive in-memory copying:
fcnDiffCalc2 <- function() {
# table of all combinations of CAT and TYP
splits <- data.frame(expand.grid(lapply(df[,-c(1,4)], function(x) c(NA, unique(x))),
stringsAsFactors = F))
# loop through each combination and calculate the difference between group X and Y
splits$DIFF <- sapply(1:nrow(splits), function(i) {
split.i <- splits[i,]
# determine non-na columns
by.cols <- colnames(split.i)[unlist(lapply(split.i, function(x) !all(is.na(x))))]
# anti-join to remove records that match `split.i`
df.i <- tryCatch(df %>%
anti_join(split.i, by = by.cols), error = function(e) df)
# calculate average by group
df.i <- df.i %>%
group_by(GRP) %>%
summarize(VAL_MEAN = mean(VAL))
# calculate difference of averages
DIFF <- df.i[,2] %>%
as.matrix() %>%
diff() %>%
as.numeric()
})
return(splits)
}
Even better, avoid the loop in expand.grid, use vapply over sapply (even the unlist + lapply = sapply or vapply) defining the outcome structure, and avoid pipes in loop to revert to base R's aggregate:
fcnDiffCalc3 <- function() {
# table of all combinations of CAT and TYP
splits <- data.frame(expand.grid(CAT = c(NA, unique(df$CAT)), TYP = c(NA, unique(df$TYP)),
stringsAsFactors = FALSE))
# loop through each combination and calculate the difference between group X and Y
splits$DIFF <- vapply(1:nrow(splits), function(i) {
split.i <- splits[i,]
# determine non-na columns
by.cols <- colnames(split.i)[vapply(split.i, function(x) !all(is.na(x)), logical(1))]
# anti-join to remove records that match `split.i`
df.i <- tryCatch(anti_join(df, split.i, by = by.cols), error = function(e) df)
# calculate average by group
df.i <- aggregate(VAL ~ GRP, df.i, mean)
# calculate difference of averages
diff(df.i$VAL)
}, numeric(1))
return(splits)
}
Output
df_op <- fcnDiffCalc()
df_new <- fcnDiffCalc2()
df_new2 <- fcnDiffCalc3()
identical(df_op, df_new)
# [1] TRUE
identical(df_op, df_new2)
# [1] TRUE
library(microbenchmark)
microbenchmark(fcnDiffCalc(), fcnDiffCalc2(), fcnDiffCalc3())
# Unit: milliseconds
# expr min lq mean median uq max neval
# fcnDiffCalc() 128.1442 140.1946 152.0703 154.3662 159.6809 180.5960 100
# fcnDiffCalc2() 115.4415 126.6108 138.0991 137.4108 145.2452 266.3297 100
# fcnDiffCalc3() 107.6847 116.9920 126.9131 126.0414 133.3887 227.2758 100

Fast subset/lookup/filter in large datasets

I need to repeatedly look up "closest" row in a large (many GB) table with factor and numeric columns. Using dplyr, it looks like this:
df <- data.frame(factorA = rep(letters[1:3], 100000),
factorB = sample(rep(letters[1:3], 100000),
3*100000, replace = FALSE),
numC = round(rnorm(3*100000), 2),
numD = round(rnorm(3*100000), 2))
closest <- function(ValueA, ValueB, ValueC, ValueD) {
df_sub <- df %>%
filter(factorA == ValueA,
factorB == ValueB,
numC >= 0.9 * ValueC,
numC <= 1.1 * ValueC,
numD >= 0.9 * ValueD,
numD <= 1.1 * ValueD)
if (nrow(df_sub) == 0) stop("Oh-oh, no candidates.")
minC <- df_sub[which.min(abs(df_sub$numC - ValueC)), "numC"]
df_sub %>%
filter(numC == minC) %>%
slice(which.min(abs(numD - ValueD))) %>%
as.list() %>%
return()
}
Here is a benchmark of the above:
> microbenchmark(closest("a", "b", 0.5, 0.6))
Unit: milliseconds
expr min lq mean median uq max neval
closest("a", "b", 0.5, 0.6) 25.20927 28.90623 35.16863 34.59485 35.25468 108.3489 100
What's the best way to optimize this function for speed? There's RAM to spare, even with the large df in memory, but given the many calls to this function, I would like to make it as fast as possible.
Would using a data.table instead of dplyr help?
Here are two optimizations I tried thus far:
dt <- as.data.table(df)
closest2 <- function(ValueA, ValueB, ValueC, ValueD) {
df_sub <- df %>%
filter(factorA == ValueA,
factorB == ValueB,
dplyr::between(numC, 0.9 * ValueC, 1.1 * ValueC),
dplyr::between(numD, 0.9 * ValueD, 1.1 * ValueD))
if (nrow(df_sub) == 0) stop("Oh-oh, no candidates.")
minC <- df_sub[which.min(abs(df_sub$numC - ValueC)), "numC"]
df_sub %>%
filter(numC == minC) %>%
slice(which.min(abs(numD - ValueD))) %>%
as.list() %>%
return()
}
closest3 <- function(ValueA, ValueB, ValueC, ValueD) {
dt_sub <- dt[factorA == ValueA &
factorB == ValueB &
numC %between% c(0.9 * ValueC, 1.1 * ValueC) &
numD %between% c(0.9 * ValueD, 1.1 * ValueD)]
if (nrow(dt_sub) == 0) stop("Oh-oh, no candidates.")
dt_sub[abs(numC - ValueC) == min(abs(numC - ValueC))][which.min(abs(numD - ValueD))] %>%
as.list() %>%
return()
}
The benchmark:
> microbenchmark(closest("a", "b", 0.5, 0.6), closest2("a", "b", 0.5, 0.6), closest3("a", "b", 0.5, 0.6))
Unit: milliseconds
expr min lq mean median uq max neval cld
closest("a", "b", 0.5, 0.6) 25.15780 25.62904 36.52022 34.68219 35.27116 155.31924 100 c
closest2("a", "b", 0.5, 0.6) 22.14465 22.46490 27.81361 31.40918 32.04427 35.79021 100 b
closest3("a", "b", 0.5, 0.6) 13.52094 13.77555 20.04284 22.70408 23.41452 142.73626 100 a
Can this be optimized more?

If you can call many tuples of values in parallel instead of sequentially...
set.seed(1)
DF <- data.frame(factorA = rep(letters[1:3], 100000),
factorB = sample(rep(letters[1:3], 100000),
3*100000, replace = FALSE),
numC = round(rnorm(3*100000), 2),
numD = round(rnorm(3*100000), 2))
library(data.table)
DT = data.table(DF)
f = function(vA, vB, nC, nD, dat = DT){
rs <- dat[.(vA, vB, nC), on=.(factorA, factorB, numC), roll="nearest",
.(g = .GRP, r = .I, numD), by=.EACHI][.(seq_along(vA), nD), on=.(g, numD), roll="nearest", mult="first",
r]
df[rs]
}
# example usage
mDT = data.table(vA = c("a", "b"), vB = c("c", "c"), nC = c(.3, .5), nD = c(.6, .8))
mDT[, do.call(f, .SD)]
# factorA factorB numC numD
# 1: a c 0.3 0.60
# 2: b c 0.5 0.76
Comparing with the other solutions that must be run rowwise...
# check the results match
library(magrittr)
dt = copy(DT)
mDT[, closest3(vA, vB, nC, nD), by=.(mr = seq_len(nrow(mDT)))]
# mr factorA factorB numC numD
# 1: 1 a c 0.3 0.60
# 2: 2 b c 0.5 0.76
# check speed for a larger number of comparisons
nr = 100
system.time( mDT[rep(1:2, each=nr), do.call(f, .SD)] )
# user system elapsed
# 0.07 0.00 0.06
system.time( mDT[rep(1:2, each=nr), closest3(vA, vB, nC, nD), by=.(mr = seq_len(nr*nrow(mDT)))] )
# user system elapsed
# 10.65 2.30 12.60
How it works
For each tuple in .(vA, vB, nC), we look up rows that match vA and vB exactly and then "roll" to the nearest value of nC -- this doesn't quite match the OP's rule (of looking within a bound of nC*[0.9, 1.1]), but that rule could easily be applied after-the-fact. For each match, we take the tuple's "group number," .GRP, the row numbers that were matched, and the values of numD on those rows.
Then we join on group number and nD, matching exactly on the former and rolling to nearest on the latter. If there are multiple nearest matches, we take the first with mult="first".
We can then take the row number of each tuple's match and look it up in the original table.
Performance
So the vectorized solution seems to have a big performance benefit, as usual with R.
If you can only pass ~5 tuples at a time (as for the OP) instead of 200, there will still probably be benefits from this approach vs which.min and similar, thanks to binary search, as #F.Privé suggested in a comment.
As noted in #HarlanNelson's answer, adding indices to the table might further improve performance. See his answer and ?setindex.
Fix for numC rolling to one value
Thanks to the OP for identifying this problem:
DT2 = data.table(id = "A", numC = rep(c(1.01,1.02), each=5), numD = seq(.01,.1,.01))
DT2[.("A", 1.011), on=.(id, numC), roll="nearest"]
# id numC numD
# 1: A 1.011 0.05
Here, we see one row, but we should be seeing five. One fix (though I'm not sure why) is converting to integers:
DT3 = copy(DT2)
DT3[, numC := as.integer(numC*100)]
DT3[, numD := as.integer(numD*100)]
DT3[.("A", 101.1), on=.(id, numC), roll="nearest"]
# id numC numD
# 1: A 101 1
# 2: A 101 2
# 3: A 101 3
# 4: A 101 4
# 5: A 101 5

This is cheating because I index before the benchmark, but I assume you will run the query many times on the same data.table.
library(data.table)
dt<-as.data.table(df)
setkey(dt,factorA,factorB)
closest2 <- function(ValueA, ValueB, ValueC, ValueD) {
dt<-dt[.(ValueA,ValueB), on = c('factorA','factorB')]
df_sub <- dt %>%
filter( numC >= 0.9 * ValueC,
numC <= 1.1 * ValueC,
numD >= 0.9 * ValueD,
numD <= 1.1 * ValueD)
if (nrow(df_sub) == 0) stop("Oh-oh, no candidates.")
minC <- df_sub[which.min(abs(df_sub$numC - ValueC)), "numC"]
df_sub %>%
filter(numC == minC) %>%
slice(which.min(abs(numD - ValueD))) %>%
as.list() %>%
return()
}
library(microbenchmark)
microbenchmark(closest("a", "b", 0.5, 0.6))
microbenchmark(closest2("a", "b", 0.5, 0.6))
Unit: milliseconds
expr min lq mean median uq max neval
closest("a", "b", 0.5, 0.6) 20.29775 22.55372 28.08176 23.20033 25.42154 127.7781 100
Unit: milliseconds
expr min lq mean median uq max neval
closest2("a", "b", 0.5, 0.6) 8.595854 9.063261 9.929237 9.396594 10.0247 16.92655 100

Rolling Sum by Another Variable in R

I want to get the rolling 7-day sum by ID. Suppose my data looks like this:
data<-as.data.frame(matrix(NA,42,3))
data$V1<-seq(as.Date("2014-05-01"),as.Date("2014-09-01"),by=3)
data$V2<-rep(1:6,7)
data$V3<-rep(c(1,2),21)
colnames(data)<-c("Date","USD","ID")
Date USD ID
1 2014-05-01 1 1
2 2014-05-04 2 2
3 2014-05-07 3 1
4 2014-05-10 4 2
5 2014-05-13 5 1
6 2014-05-16 6 2
7 2014-05-19 1 1
8 2014-05-22 2 2
9 2014-05-25 3 1
10 2014-05-28 4 2
How can I add a new column that will contain the rolling 7-day sum by ID?

If your data is big, you might want to check out this solution which uses data.table. It is pretty fast. If you need more speed, you can always change mapply to mcmapply and use multiple cores.
#Load data.table and convert to data.table object
require(data.table)
setDT(data)[,ID2:=.GRP,by=c("ID")]
#Build reference table
Ref <- data[,list(Compare_Value=list(I(USD)),Compare_Date=list(I(Date))), by=c("ID2")]
#Use mapply to get last seven days of value by id
data[,Roll.Val := mapply(RD = Date,NUM=ID2, function(RD, NUM) {
d <- as.numeric(Ref$Compare_Date[[NUM]] - RD)
sum((d <= 0 & d >= -7)*Ref$Compare_Value[[NUM]])})]

Dataset provided by OP does not expose the complexity of the task. In terms of addressing OP question so far only Mike's answer was the correct one.
In fact for a 8 rolling days, instead of 7 rolling days, due to d <= 0 & d >= -7.
zoo solution by #G. Grothendieck is almost valid, only if merge would be made to each group of ID.
Below second data.table solution, this time valid results, using dev RcppRoll which allows na.rm=TRUE.
And slightly formatted Mike's solution output.
data<-as.data.frame(matrix(NA,42,3))
data$V1<-seq(as.Date("2014-05-01"),as.Date("2014-09-01"),by=3)
data$V2<-rep(1:6,7)
data$V3<-rep(c(1,2),21)
colnames(data)<-c("Date","USD","ID")
library(microbenchmark)
library(RcppRoll) # install_github("kevinushey/RcppRoll")
library(data.table) # install_github("Rdatatable/data.table")
correct_jan_dt = function(n, partial=TRUE){
DT = as.data.table(data) # this can be speedup by setDT()
date.range = DT[,range(Date)]
all.dates = seq.Date(date.range[1],date.range[2],by=1)
setkey(DT,ID,Date)
r = DT[CJ(unique(ID),all.dates)][, c("roll") := as.integer(roll_sumr(USD, n, normalize = FALSE, na.rm = TRUE)), by="ID"][!is.na(USD)]
# This could be simplified when `partial` arg will be implemented in [kevinushey/RcppRoll](https://github.com/kevinushey/RcppRoll)
if(isTRUE(partial)){
r[is.na(roll), roll := cumsum(USD), by="ID"][]
}
return(r[order(Date,ID)])
}
correct_mike_dt = function(){
data = as.data.table(data)[,ID2:=.GRP,by=c("ID")]
#Build reference table
Ref <- data[,list(Compare_Value=list(I(USD)),Compare_Date=list(I(Date))), by=c("ID2")]
#Use mapply to get last seven days of value by id
data[, c("roll") := mapply(RD = Date,NUM=ID2, function(RD, NUM){
d <- as.numeric(Ref$Compare_Date[[NUM]] - RD)
sum((d <= 0 & d >= -7)*Ref$Compare_Value[[NUM]])})][,ID2:=NULL][]
}
identical(correct_mike_dt(), correct_jan_dt(n=8,partial=TRUE))
# [1] TRUE
microbenchmark(unit="relative", times=5L, correct_mike_dt(), correct_jan_dt(8))
# Unit: relative
# expr min lq mean median uq max neval
# correct_mike_dt() 274.0699 273.9892 267.2886 266.6009 266.2254 256.7296 5
# correct_jan_dt(8) 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 5
Looking forward for update from #Khashaa.
Edit (20150122.2): Below benchmarks do not answer OP question.
Timing on a bigger (still very tiny) dataset, 5439 rows:
library(zoo)
library(data.table)
library(dplyr)
library(RcppRoll)
library(microbenchmark)
data<-as.data.frame(matrix(NA,5439,3))
data$V1<-seq(as.Date("1970-01-01"),as.Date("2014-09-01"),by=3)
data$V2<-sample(1:6,5439,TRUE)
data$V3<-sample(c(1,2),5439,TRUE)
colnames(data)<-c("Date","USD","ID")
zoo_f = function(){
z <- read.zoo(data)
z0 <- merge(z, zoo(, seq(start(z), end(z), "day")), fill = 0) # expand to daily
roll <- function(x) rollsumr(x, 7, fill = NA)
transform(data, roll = ave(z0$USD, z0$ID, FUN = roll)[time(z)])
}
dt_f = function(){
DT = as.data.table(data) # this can be speedup by setDT()
date.range = DT[,range(Date)]
all.dates = seq.Date(date.range[1],date.range[2],by=1)
setkey(DT,Date)
DT[.(all.dates)
][order(Date), c("roll") := rowSums(setDT(shift(USD, 0:6, NA, "lag")),na.rm=FALSE), by="ID"
][!is.na(ID)]
}
dp_f = function(){
data %>% group_by(ID) %>%
mutate(roll=roll_sum(c(rep(NA,6), USD), 7))
}
dt2_f = function(){
# this can be speedup by setDT()
as.data.table(data)[, c("roll") := roll_sum(c(rep(NA,6), USD), 7), by="ID"][]
}
identical(as.data.table(zoo_f()),dt_f())
# [1] TRUE
identical(setDT(as.data.frame(dp_f())),dt_f())
# [1] TRUE
identical(dt2_f(),dt_f())
# [1] TRUE
microbenchmark(unit="relative", times=20L, zoo_f(), dt_f(), dp_f(), dt2_f())
# Unit: relative
# expr min lq mean median uq max neval
# zoo_f() 140.331889 141.891917 138.064126 139.381336 136.029019 137.730171 20
# dt_f() 14.917166 14.464199 15.210757 16.898931 16.543811 14.221987 20
# dp_f() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20
# dt2_f() 1.536896 1.521983 1.500392 1.518641 1.629916 1.337903 20
Yet I'm not sure if my data.table code is already optimal.
Above functions did not answer OP question. Read the top of post for update. Mike's solution was the correct one.

1) Assuming you mean every successive overlapping 7 rows for that ID:
library(zoo)
transform(data, roll = ave(USD, ID, FUN = function(x) rollsumr(x, 7, fill = NA)))
2) If you really did mean 7 days and not 7 rows then try this:
library(zoo)
z <- read.zoo(data)
z0 <- merge(z, zoo(, seq(start(z), end(z), "day")), fill = 0) # expand to daily
roll <- function(x) rollsumr(x, 7, fill = NA)
transform(data, roll = ave(z0$USD, z0$ID, FUN = roll)[time(z)])
Updated Added (2) and made some improvements.

library(data.table)
data <- data.table(Date = seq(as.Date("2014-05-01"),
as.Date("2014-09-01"),
by = 3),
USD = rep(1:6, 7),
ID = rep(c(1, 2), 21))
data[, Rolling7DaySum := {
d <- data$Date - Date
sum(data$USD[ID == data$ID & d <= 0 & d >= -7])
},
by = list(Date, ID)]

I found that there is some problem with Mike.Gahan's suggested code and correct it as below after testing it out.
require(data.table)
setDT(data)[,ID2:=.GRP,by=c("ID")]
Ref <-data[,list(Compare_Value=list(I(USD)),Compare_Date=list(I(Date))),by=c("ID2")]
data[,Roll.Val := mapply(RD = Date,NUM=ID2, function(RD, NUM) {
d <- as.numeric(Ref[ID2 == NUM,]$Compare_Date[[1]] - RD)
sum((d <= 0 & d >= -7)*Ref[ID2 == NUM,]$Compare_Value[[1]])})]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

mutate() with an if/else function - r

Related

Sum over 7 day rolling window in R [duplicate]

Does Column of Sentences Contain Word in Another Column of Sentences?

R - Calculating differences by group for all cuts of data

Fast subset/lookup/filter in large datasets

Rolling Sum by Another Variable in R

Categories

Resources