I need to count of future visits by specific customer in the next 7 days. I solved this with purrr:map2 but I'm experiencing very slow performance. I think I must be missing something basic about how to use purrr. How do I speed this up? Thanks
This toy example takes 2.3 secs with 100 rows, but 3.3 minutes with 1000 rows on my machine. My actual data has 400K rows!
library(tidyverse)
set.seed(123)
rows <- 1000
df= data.frame(cust_num = sample(c("123","124","128"),rows,replace=T),
date = sample(seq(as.Date('2017/01/01'), as.Date('2017/01/31'), by="day"), rows, replace=T))
df <- df %>%
rowwise() %>%
mutate( visits.next.7.days = map2_lgl(df$cust_num,df$date,~.x==cust_num&.y>date&.y<(date+7)) %>% sum() )
Here's an option that uses purrr::reduce to sum the list of vectors returned by data.table::shift (a vectorized version of lead/lag). pmap_int with sum would do the same as reduce with + if you like, but it a little slower. You could similarly do map(1:7, ~lead(n, .x, default = 0L)) instead of data.table::shift, but it's more code and slower.
library(tidyverse)
set.seed(123)
rows <- 1000
df = data.frame(cust_num = sample(c("123","124","128"), rows, replace = TRUE),
date = sample(seq(as.Date('2017/01/01'),
as.Date('2017/01/31'),
by = "day"),
rows, replace = TRUE))
df2 <- df %>%
count(cust_num, date) %>%
group_by(cust_num) %>%
# add dates with no occurrences; none in sample data, but quite possible in real
complete(date = seq(min(date), max(date), by = 'day'), fill = list(n = 0L)) %>%
mutate(visits_next_7 = reduce(data.table::shift(n, 1:7, type = 'lead', fill = 0L), `+`)) %>%
right_join(df)
df2
#> # A tibble: 1,000 x 4
#> # Groups: cust_num [?]
#> cust_num date n visits_next_7
#> <fctr> <date> <int> <int>
#> 1 123 2017-01-09 10 78
#> 2 128 2017-01-19 12 70
#> 3 124 2017-01-05 15 73
#> 4 128 2017-01-27 14 37
#> 5 128 2017-01-27 14 37
#> 6 123 2017-01-15 19 74
#> 7 124 2017-01-24 12 59
#> 8 128 2017-01-10 10 78
#> 9 124 2017-01-03 19 77
#> 10 124 2017-01-14 8 84
#> # ... with 990 more rows
This may not be the most efficient algorithm, as depending on the spacing of your data, complete could potentially expand your data dramatically.
Further, with data this size, you may find data.table is more practical unless you want to put your data in a database and access it with dplyr.
A solution using the zoo package. The idea is to group the data by cust_num and date and count the row number first, and then use the lead function to shift the count number by 1 and use rollapply to calculate the sum of the next six days (not include the beginning date). Finally, use left_join to merge the results back to the original data frame. This should be much faster than your original approach. df3 is the final output.
library(dplyr)
library(zoo)
df2 <- df %>%
count(cust_num, date) %>%
ungroup() %>%
mutate(n2 = lead(n)) %>%
mutate(visits.next.7.days = rollapply(n2, width = 6, FUN = sum, na.rm = TRUE,
align = "left", partial = TRUE)) %>%
select(cust_num, date, visits.next.7.days)
df3 <- df %>% left_join(df2, by = c("cust_num", "date"))
head(df3)
# cust_num date visits.next.7.days
# 1 123 2017-01-09 70
# 2 128 2017-01-19 54
# 3 124 2017-01-05 58
# 4 128 2017-01-27 37
# 5 128 2017-01-27 37
# 6 123 2017-01-15 68
Related
I have a dataframe that I want to gather so that it is in tall format, and then mutate on another column with values based on membership of a string from another column in a list of lists. For example, I have the following data frame and list of lists:
dummy_data <- data.frame("id" = 1:20,"test1_10" = sample(1:100, 20),"test2_11" = sample(1:100, 20),
"test3_12" = sample(1:100, 20),"check1_20" = sample(1:100, 20),
"check2_21" = sample(1:100, 20),"sound1_30" = sample(1:100, 20),
"sound2_31" = sample(1:100, 20),"sound3_32" = sample(1:100, 20))
dummylist <- list(c('test1_','test2_','test3_'),c('check1_','check2_'),c('sound1_','sound2_','sound3_'))
names(dummylist) <- c('shipments','arrivals','departures')
And then I gather the data frame like so:
dummy_data <- dummy_data %>%
gather("part", "number", 2:ncol(.))
What I want to do is add a column that has the name of the list found in dummylist where the string before the underscore in the part column is a member. And I can do that like this:
dummydata <- dummydata %>%
mutate(Group = case_when(
str_extract(part,'.*_') %in% dummylist[[1]] ~ names(dummylist[1]),
str_extract(part,'.*_') %in% dummylist[[2]] ~ names(dummylist[2]),
str_extract(part,'.*_') %in% dummylist[[3]] ~ names(dummylist[3])
))
However, this requires a separate str_extract line for each list/group within the dummylist. And my real data has way more than 3 lists/groups. So I'm wondering if there is a more efficient way to do this mutate step to get the names of the lists in?
Any help is much appreciated, thanks!
It may be easier with a regex_left_join after converting the 'dummylist' to a two column dataset
library(fuzzyjoin)
library(dplyr)
library(tidyr)
library(tibble)
dummy_data %>%
# // reshape to long format - pivot_longer instead of gather
pivot_longer(cols = -id, names_to = 'part', values_to = 'number') %>%
# // join with the tibble/data.frame converted dummylist
regex_left_join(dummylist %>%
enframe(name = 'Group', value = 'part') %>%
unnest(part)) %>%
rename(part = part.x) %>%
select(-part.y)
-output
# A tibble: 160 × 4
id part number Group
<int> <chr> <int> <chr>
1 1 test1_10 72 shipments
2 1 test2_11 62 shipments
3 1 test3_12 17 shipments
4 1 check1_20 89 arrivals
5 1 check2_21 54 arrivals
6 1 sound1_30 39 departures
7 1 sound2_31 94 departures
8 1 sound3_32 95 departures
9 2 test1_10 77 shipments
10 2 test2_11 4 shipments
# … with 150 more rows
If you prepare your lookup table beforehand, you don't need any extra libraries, but dplyr and tidyr:
lookup <- sapply(
names(dummylist),
\(nm) { setNames(rep(nm, length(dummylist[[nm]])), dummylist[[nm]]) }
) |>
setNames(nm = NULL) |>
unlist()
lookup
# test1_ test2_ test3_ check1_ check2_ sound1_ sound2_ sound3_
# "shipments" "shipments" "shipments" "arrivals" "arrivals" "departures" "departures" "departures"
Now you just gsubing on the fly, and translating your parts, within usual mutate() verb:
dummy_data |>
pivot_longer(-id, names_to = 'part', values_to = 'number') |>
mutate(group = lookup[gsub('^(\\w+_).*$', '\\1', part)])
# # A tibble: 160 × 4
# id part number group
# <int> <chr> <int> <chr>
# 1 1 test1_10 91 shipments
# 2 1 test2_11 74 shipments
# 3 1 test3_12 46 shipments
# 4 1 check1_20 62 arrivals
# 5 1 check2_21 7 arrivals
# 6 1 sound1_30 35 departures
# 7 1 sound2_31 23 departures
# 8 1 sound3_32 84 departures
# 9 2 test1_10 59 shipments
# 10 2 test2_11 73 shipments
# # … with 150 more rows
I have data like this:
set.seed(2020)
df_time = data.frame(Time = as.Date(1:100), value = round(runif(100, min = 0, 100)))
head(df_time)
Time value
1 1970-01-02 65
2 1970-01-03 39
3 1970-01-04 62
4 1970-01-05 48
5 1970-01-06 14
6 1970-01-07 7
7 1970-01-08 13
8 1970-01-09 39
9 1970-01-10 0
10 1970-01-11 62
And this:
df = data.frame(from= as.Date(c(3,6, 20)),to= as.Date(c(8,7, 24)),)
head(df)
My goal is to mutate the dataframe df such that it adds the sum of the values between the two given dates (excluding the from date and including the to date), i.e.
from to sum_value
1 1970-01-04 1970-01-09 121
2 1970-01-07 1970-01-08 13
3 1970-01-21 1970-01-25 204
Atthe moment I am doing it with a loop but that takes way too long for the amount of data I have.
Does anyone knows a better solution (e.g. with dplyr)?
You can use rowwise() in this case to tell dplyr to evaluate df row by row:
df %>%
rowwise() %>%
mutate(sum_value = df_time %>%
filter(Time > from, Time <= to) %>%
pull(value) %>%
sum()) %>%
ungroup() # ungroup is used to restore default behaviour of dplyr
#> # A tibble: 3 x 3
#> from to sum_value
#> <date> <date> <dbl>
#> 1 1970-01-04 1970-01-09 121
#> 2 1970-01-07 1970-01-08 13
#> 3 1970-01-21 1970-01-25 204
This is how I would do since I find the code easy to understand. But it's basically still a loop under the hood as far as I get it.
A different and potentially faster approach could be to use data.table. I'm not as familiar with the syntax so there might be a better way, but this works:
library(data.table)
# convert to data.table
df_time <- setDT(df_time)
df <- setDT(df)
# duplicate Time column since foverlaps needs two of them
df_time <- df_time[, Time2 := Time]
# Since from day should not be included
df <- df[, from := from + 1]
setkey(df, from, to)
res <- foverlaps(df_time, df, by.x = c("Time", "Time2"), by.y = c("from", "to"), type = "within")
res <- res[, .(value = sum(value)), keyby = .(from, to)]
res[, from := from - 1]
#> from to value
#> 1: <NA> <NA> 4622
#> 2: 1970-01-04 1970-01-09 121
#> 3: 1970-01-07 1970-01-08 13
#> 4: 1970-01-21 1970-01-25 204
consider the following example
library(dplyr)
library(lubridate)
time <- seq(from =ymd("2014-01-01"),to= ymd("2014-02-20"), by="days")
values <- sample(seq(from = 20, to = 50, by = 5), size = length(time), replace = TRUE)
tipe <- sample(rep(x = c("Tipe_A", "Tipe_B", "Tipe_C")), size = length(time), replace = TRUE)
df2 <- data_frame(time, tipe, values)
# A tibble: 51 x 3
time tipe values
<date> <chr> <dbl>
1 2014-01-01 Tipe_B 40
2 2014-01-02 Tipe_B 30
3 2014-01-03 Tipe_A 35
4 2014-01-04 Tipe_A 50
5 2014-01-05 Tipe_B 35
6 2014-01-06 Tipe_B 50
7 2014-01-07 Tipe_A 50
8 2014-01-08 Tipe_B 40
9 2014-01-09 Tipe_A 30
10 2014-01-10 Tipe_B 25
# ... with 41 more rows
I would like to calculate the differentials between values and aggregate this dataframe by week and tipe.
I can do it only separated by type
df2 %>%
filter(tipe == "Tipe_A") %>%
mutate(diff = values - lag(values, order_by = time)) %>%
group_by(week = week(time)) %>%
summarise(avr = mean(diff, na.rm = T))
# A tibble: 7 x 2
week avr
<dbl> <dbl>
1 1 7.5
2 2 -20
3 3 3.33
4 5 0
5 6 -3.33
6 7 -10
7 8 25
however I have a lot of types, so it would be a tedious process.
Is there a way to make it more efficient for each type?
Here, we may need to do the grouping by 'tipe' first, then calculate the 'diff', add the 'week' also as grouping column before we get the mean in summarise
library(dplyr)
df2 %>%
group_by(tipe) %>%
mutate(diff = values - lag(values, order_by = time)) %>%
group_by(week = week(time), .add = TRUE) %>%
summarise(avr = mean(diff, na.rm = TRUE))
Or do the arrange first
df2 %>%
arrange(tipe, time) %>%
group_by(tipe) %>%
mutate(diff = values - lag(values)) %>%
group_by(week = week(time), .add = TRUE) %>%
summarise(avr = mean(diff, na.rm = TRUE))
I have data set of hospital admission and discharge days from which I want to generate an occupied beds count for each calendar day of a period of three years. I am using the tidyverse and lubridate packages.
My approach so far has been to convert the admit/discharge columns into an interval (the data are sensitive so I can't share actual dates):
d <- d %>%
mutate(duration = admit %--% discharge)
and then to create a tibble where each row corresponds to the time range, plus a column of zeroes that can be added to in a for loop:
t <-
tibble(
days = as.Date(date("2017-01-01"):date("2019-12-31")),
count = 0
)
Unfortunately, I can't figure out how to create a for loop that would sum days that fall within each interval. Here is my attempt thus far, which gives me uniform values of 24 throughout:
for(i in timeline$days) {
if (i %within% d$duration)
timeline$count = timeline$count + 1
}
Sample data.
library(dplyr)
set.seed(42)
d <- tibble(admit = Sys.Date() - sample(300, size = 1000, replace = TRUE)) %>%
mutate(discharge = admit + sample(0:30, size = 1000, replace = TRUE))
d
# # A tibble: 1,000 x 2
# admit discharge
# <date> <date>
# 1 2019-06-18 2019-07-14
# 2 2019-06-11 2019-06-12
# 3 2019-12-24 2020-01-18
# 4 2019-07-13 2019-07-29
# 5 2019-09-08 2019-09-23
# 6 2019-10-15 2019-10-15
# 7 2019-08-11 2019-08-28
# 8 2020-02-07 2020-02-29
# 9 2019-09-03 2019-09-10
# 10 2019-08-20 2019-09-14
# # ... with 990 more rows
We can produce a list of date ranges/sequences with Map (or purrr::pmap):
Map(seq.Date, d$admit, d$discharge, list(by = "days"))[1:2]
# [[1]]
# [1] "2019-06-18" "2019-06-19" "2019-06-20" "2019-06-21" "2019-06-22" "2019-06-23" "2019-06-24"
# [8] "2019-06-25" "2019-06-26" "2019-06-27" "2019-06-28" "2019-06-29" "2019-06-30" "2019-07-01"
# [15] "2019-07-02" "2019-07-03" "2019-07-04" "2019-07-05" "2019-07-06" "2019-07-07" "2019-07-08"
# [22] "2019-07-09" "2019-07-10" "2019-07-11" "2019-07-12" "2019-07-13" "2019-07-14"
# [[2]]
# [1] "2019-06-11" "2019-06-12"
and then combine these, tabulate them (with table), and enframe them:
Map(seq.Date, d$admit, d$discharge, list(by = "days")) %>%
do.call(c, .) %>%
table() %>%
tibble::enframe(name = "date", value = "count") %>%
# because `table` preserves a *character* representation of the Date
mutate(date = as.Date(date)) %>%
arrange(date)
# # A tibble: 328 x 2
# date count
# <date> <table>
# 1 2019-05-24 1
# 2 2019-05-25 3
# 3 2019-05-26 7
# 4 2019-05-27 8
# 5 2019-05-28 9
# 6 2019-05-29 14
# 7 2019-05-30 20
# 8 2019-05-31 20
# 9 2019-06-01 20
# 10 2019-06-02 21
# # ... with 318 more rows
Here is another method using tidyverse functions.
library(tidyverse)
d %>%
mutate(days = map2(admit, discharge, seq, by = "day")) %>%
unnest(days) %>%
count(days) %>%
right_join(t, by = "days") %>%
mutate(n = coalesce(n, as.integer(count))) %>%
select(-count)
We create a sequennce of dates between admit and discharge, count every unique date, join it with t so that all the dates in t remain intact.
I have a question that I find kind of hard to explain with a MRE and in an easy
way to answer, mostly because I don't fully understand where the problem lies
myself. So that's my sorry for being vague preamble.
I have a tibble with many sample and reference measurements, for which I want
to do some linear interpolation for each sample. I do this now by taking out
all the reference measurements, rescaling them to sample measurements using
approx, and then patching it back in. But because I take it out first, I
cannot do it nicely in a group_by dplyr pipe way. right now I do it with a
really ugly workaround where I add empty (NA) newly created columns to the
sample tibble, then do it with a for-loop.
So my question is really: how can I implement the approx part within groups
into the pipe, so that I can do everything within groups? I've experimented
with dplyr::do(), and ran into the vignette on "programming with dplyr", but
searching mostly gives me broom::augment and lm stuff that I think operates
differently... (e.g. see
Using approx() with groups in dplyr). This thread also seems promising: How do you use approx() inside of mutate_at()?
Somebody on irc recommended using a conditional mutate, with case_when, but I
don't fully understand where and how within this context yet.
I think the problem lies in the fact that I want to filter out part of the data
for the following mutate operations, but the mutate operations rely on the
grouped data that I just filtered out, if that makes any sense.
Here's a MWE:
library(tidyverse) # or just dplyr, tibble
# create fake data
data <- data.frame(
# in reality a dttm with the measurement time
timestamp = c(rep("a", 7), rep("b", 7), rep("c", 7)),
# measurement cycle, normally 40 for sample, 41 for reference
cycle = rep(c(rep(1:3, 2), 4), 3),
# wheather the measurement is a reference or a sample
isref = rep(c(rep(FALSE, 3), rep(TRUE, 4)), 3),
# measurement intensity for mass 44
r44 = c(28:26, 30:26, 36, 33, 31, 38, 34, 33, 31, 18, 16, 15, 19, 18, 17)) %>%
# measurement intensity for mass 45, normally also masses up to mass 49
mutate(r45 = r44 + rnorm(21, 20))
# of course this could be tidied up to "intensity" with a new column "mass"
# (44, 45, ...), but that would make making comparisons even harder...
# overview plot
data %>%
ggplot(aes(x = cycle, y = r44, colour = isref)) +
geom_line() +
geom_line(aes(y = r45), linetype = 2) +
geom_point() +
geom_point(aes(y = r45), shape = 1) +
facet_grid(~ timestamp)
# what I would like to do
data %>%
group_by(timestamp) %>%
do(target_cycle = approx(x = data %>% filter(isref) %>% pull(r44),
y = data %>% filter(isref) %>% pull(cycle),
xout = data %>% filter(!isref) %>% pull(r44))$y) %>%
unnest()
# immediately append this new column to the original dataframe for all the
# samples (!isref) and then apply another approx for those values.
# here's my current attempt for one of the timestamps
matchref <- function(dat) {
# split the data into sample gas and reference gas
ref <- filter(dat, isref)
smp <- filter(dat, !isref)
# calculate the "target cycle", the points at which the reference intensity
# 44 matches the sample intensity 44 with linear interpolation
target_cycle <- approx(x = ref$r44,
y = ref$cycle, xout = smp$r44)
# append the target cycle to the sample gas
smp <- smp %>%
group_by(timestamp) %>%
mutate(target = target_cycle$y)
# linearly interpolate each reference gas to the target cycle
ref <- ref %>%
group_by(timestamp) %>%
# this is needed because the reference has one more cycle
mutate(target = c(target_cycle$y, NA)) %>%
# filter out all the failed ones (no interpolation possible)
filter(!is.na(target)) %>%
# calculate interpolated value based on r44 interpolation (i.e., don't
# actually interpolate this value but shift it based on the 44
# interpolation)
mutate(r44 = approx(x = cycle, y = r44, xout = target)$y,
r45 = approx(x = cycle, y = r45, xout = target)$y) %>%
select(timestamp, target, r44:r45)
# add new reference gas intensities to the correct sample gasses by the target cycle
left_join(smp, ref, by = c("time", "target"))
}
matchref(data)
# and because now "target" must be length 3 (the group size) or one, not 9
# I have to create this ugly for-loop
# for which I create a copy of data that has the new columns to be created
mr <- data %>%
# filter the sample gasses (since we convert ref to sample)
filter(!isref) %>%
# add empty new columns
mutate(target = NA, r44 = NA, r45 = NA)
# apply matchref for each group timestamp
for (grp in unique(data$timestamp)) {
mr[mr$timestamp == grp, ] <- matchref(data %>% filter(timestamp == grp))
}
Here's one approach that spreads the references and samples to new columns. I drop r45 for simplicity in this example.
data %>%
select(-r45) %>%
mutate(isref = ifelse(isref, "REF", "SAMP")) %>%
spread(isref, r44) %>%
group_by(timestamp) %>%
mutate(target_cycle = approx(x = REF, y = cycle, xout = SAMP)$y) %>%
ungroup
gives,
# timestamp cycle REF SAMP target_cycle
# <fct> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 30 28 3
# 2 a 2 29 27 4
# 3 a 3 28 26 NA
# 4 a 4 27 NA NA
# 5 b 1 31 26 NA
# 6 b 2 38 36 2.5
# 7 b 3 34 33 4
# 8 b 4 33 NA NA
# 9 c 1 15 31 NA
# 10 c 2 19 18 3
# 11 c 3 18 16 2.5
# 12 c 4 17 NA NA
Edit to address comment below
To retain r45 you can use a gather-unite-spread approach like this:
df %>%
mutate(isref = ifelse(isref, "REF", "SAMP")) %>%
gather(r, value, r44:r45) %>%
unite(ru, r, isref, sep = "_") %>%
spread(ru, value) %>%
group_by(timestamp) %>%
mutate(target_cycle_r44 = approx(x = r44_REF, y = cycle, xout = r44_SAMP)$y) %>%
ungroup
giving,
# # A tibble: 12 x 7
# timestamp cycle r44_REF r44_SAMP r45_REF r45_SAMP target_cycle_r44
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 30 28 49.5 47.2 3
# 2 a 2 29 27 48.8 48.7 4
# 3 a 3 28 26 47.2 46.8 NA
# 4 a 4 27 NA 47.9 NA NA
# 5 b 1 31 26 51.4 45.7 NA
# 6 b 2 38 36 57.5 55.9 2.5
# 7 b 3 34 33 54.3 52.4 4
# 8 b 4 33 NA 52.0 NA NA
# 9 c 1 15 31 36.0 51.7 NA
# 10 c 2 19 18 39.1 37.9 3
# 11 c 3 18 16 39.2 35.3 2.5
# 12 c 4 17 NA 39.0 NA NA