I'm having some trouble trying to do a count of days based on starting dates. I basically just want a count of days passed since the starting date by product.
I think it is best illustrated by example.
This is what I start with:
df1 <- data.frame(Dates = seq(as.Date("2021/1/1"), as.Date("2021/1/15"), "days"),
Product = rep(c(rep("Banana", 5), rep("Apple", 5), rep("Orange", 5)))
)
Dates Product
1 2021-01-01 Banana
2 2021-01-02 Banana
3 2021-01-03 Banana
4 2021-01-04 Banana
5 2021-01-05 Banana
6 2021-01-06 Apple
7 2021-01-07 Apple
8 2021-01-08 Apple
9 2021-01-09 Apple
10 2021-01-10 Apple
11 2021-01-11 Orange
12 2021-01-12 Orange
13 2021-01-13 Orange
14 2021-01-14 Orange
15 2021-01-15 Orange
I currently have several measurements for each product that I need to plot as number of days rather than dates and I cannot make the transformation.
And this is what I want:
desired_df <- data.frame(Dates = seq(as.Date("2021/1/1"), as.Date("2021/1/15"), "days"),
Product = rep(c(rep("Banana", 5), rep("Apple", 5), rep("Orange", 5))),
Days = rep(seq(0, 4), 3)
)
Dates Product Days
1 2021-01-01 Banana 0
2 2021-01-02 Banana 1
3 2021-01-03 Banana 2
4 2021-01-04 Banana 3
5 2021-01-05 Banana 4
6 2021-01-06 Apple 0
7 2021-01-07 Apple 1
8 2021-01-08 Apple 2
9 2021-01-09 Apple 3
10 2021-01-10 Apple 4
11 2021-01-11 Orange 0
12 2021-01-12 Orange 1
13 2021-01-13 Orange 2
14 2021-01-14 Orange 3
15 2021-01-15 Orange 4
So far I've tried a few approaches, but none works.
df2 <- df1 %>%
mutate(Days = Dates - Dates[1])
df3 <- df1 %>%
group_by(Product) %>%
mutate(Days = Dates - Dates[1])
Dates Product Days
starter_dates <- df1 %>%
aggregate(by = list(df1$Product), FUN = first)
Group.1 Dates Product
1 Apple 2021-01-06 Apple
2 Banana 2021-01-01 Banana
3 Orange 2021-01-11 Orange
df4 <- df1 %>%
mutate(
Days = case_when(Product == starter_dates$Product ~ Dates - starter_dates$Dates)
)
But none produced what I want. How can I calculate the number of days from first appearance?
EDIT:
This is what I get from suggested answers:
> df1 %>% group_by(Product) %>% mutate(Days = as.numeric(Dates - Dates[1]))
# A tibble: 15 x 3
# Groups: Product [3]
Dates Product Days
<date> <chr> <dbl>
1 2021-01-01 Banana 0
2 2021-01-02 Banana 1
3 2021-01-03 Banana 2
4 2021-01-04 Banana 3
5 2021-01-05 Banana 4
6 2021-01-06 Apple 5
7 2021-01-07 Apple 6
8 2021-01-08 Apple 7
9 2021-01-09 Apple 8
10 2021-01-10 Apple 9
11 2021-01-11 Orange 10
12 2021-01-12 Orange 11
13 2021-01-13 Orange 12
14 2021-01-14 Orange 13
15 2021-01-15 Orange 14
Ensuring no conflicts from other packages, below now works.
df1 %>% group_by(Product) %>%
mutate(Days=lubridate::day(Dates)-first(lubridate::day(Dates)))
We can subtract the "Date", for every row, from the first "Date" value:
df1 %>% group_by(Product) %>%
mutate(Days=lubridate::day(Dates)-first(lubridate::day(Dates)))
# A tibble: 15 x 3
# Groups: Product [3]
Dates Product Days
<date> <chr> <int>
1 2021-01-01 Banana 0
2 2021-01-02 Banana 1
3 2021-01-03 Banana 2
4 2021-01-04 Banana 3
5 2021-01-05 Banana 4
6 2021-01-06 Apple 0
7 2021-01-07 Apple 1
8 2021-01-08 Apple 2
9 2021-01-09 Apple 3
10 2021-01-10 Apple 4
11 2021-01-11 Orange 0
12 2021-01-12 Orange 1
13 2021-01-13 Orange 2
14 2021-01-14 Orange 3
15 2021-01-15 Orange 4
Since using tidyverse is not a requirement, here a base R solution:
data.frame( df1, Days=as.vector( sapply( unique(df1$Product),
function(x) df1$Dates[df1$Product==x] - df1$Dates[df1$Product==x][1] ) ) )
Dates Product Days
1 2021-01-01 Banana 0
2 2021-01-02 Banana 1
3 2021-01-03 Banana 2
4 2021-01-04 Banana 3
5 2021-01-05 Banana 4
6 2021-01-06 Apple 0
7 2021-01-07 Apple 1
8 2021-01-08 Apple 2
9 2021-01-09 Apple 3
10 2021-01-10 Apple 4
11 2021-01-11 Orange 0
12 2021-01-12 Orange 1
13 2021-01-13 Orange 2
14 2021-01-14 Orange 3
15 2021-01-15 Orange 4
Related
What I want to do
I have a dataset of protest events in the United States. Some events are stand-alone events, while others persist day-after-day (a "multi-day event"). My dataset is structured at the daily level, so a three-day multi-day event is spread out over three rows.
I want to accomplish the following:
Create a cumulative sum of the number of days thus far in any given multi-day event. Specifically, I want to count the number of days between the "First day" and "Last day" of any linked event.
Put the total number of days of each multi-event as a variable
"Name" each multi-day event by concatenating the state in which the protest occurred and a sequential identity number starting at 1 in each state and extending upwards.
Data
Here's a reproducible example:
# Library
library(tidyverse) # Brings in dplyr
# Generate the data set of protests
test <- data.frame(state = c(rep("Washington", 10), rep("Idaho", 10)),
date = lubridate::ymd(c("2021-01-01", "2021-01-03", "2021-01-04", "2021-01-10", "2021-01-15",
"2021-01-16", "2021-01-17", "2021-01-18", "2021-01-19", "2021-01-28",
"2021-01-12", "2021-01-13", "2021-01-14", "2021-02-01", "2021-02-03",
"2021-02-04", "2021-02-05", "2021-02-08", "2021-02-10", "2021-02-14")))
# Now create some lags and leads
test <- test %>%
group_by(state) %>%
mutate(date_lag = lag(date),
date_lead = lead(date),
days_last = date - date_lag,
days_next = date_lead - date,
link_last = if_else(days_last <= 1, 1, 0),
link_next = if_else(days_next <= 1, 1, 0),
sequence = if_else(link_last == 0 & link_next == 1, "First day",
if_else(is.na(link_last) == TRUE & link_next == 1, "First day",
if_else(link_last == 1 & link_next == 1, "Ongoing",
if_else(link_last == 1 & link_next == 0, "Last day",
if_else(link_last == 1 & is.na(link_next)==TRUE, "Last day", "Not linked"))))))
This generates the following dataframe:
state date date_lag date_lead days_last days_next link_last link_next sequence
<chr> <date> <date> <date> <drtn> <drtn> <dbl> <dbl> <chr>
1 Washington 2021-01-01 NA 2021-01-03 NA days 2 days NA 0 NA
2 Washington 2021-01-03 2021-01-01 2021-01-04 2 days 1 days 0 1 First day
3 Washington 2021-01-04 2021-01-03 2021-01-10 1 days 6 days 1 0 Last day
4 Washington 2021-01-10 2021-01-04 2021-01-15 6 days 5 days 0 0 Not linked
5 Washington 2021-01-15 2021-01-10 2021-01-16 5 days 1 days 0 1 First day
6 Washington 2021-01-16 2021-01-15 2021-01-17 1 days 1 days 1 1 Ongoing
7 Washington 2021-01-17 2021-01-16 2021-01-18 1 days 1 days 1 1 Ongoing
8 Washington 2021-01-18 2021-01-17 2021-01-19 1 days 1 days 1 1 Ongoing
9 Washington 2021-01-19 2021-01-18 2021-01-28 1 days 9 days 1 0 Last day
10 Washington 2021-01-28 2021-01-19 NA 9 days NA days 0 NA NA
11 Idaho 2021-01-12 NA 2021-01-13 NA days 1 days NA 1 NA
12 Idaho 2021-01-13 2021-01-12 2021-01-14 1 days 1 days 1 1 Ongoing
13 Idaho 2021-01-14 2021-01-13 2021-02-01 1 days 18 days 1 0 Last day
14 Idaho 2021-02-01 2021-01-14 2021-02-03 18 days 2 days 0 0 Not linked
15 Idaho 2021-02-03 2021-02-01 2021-02-04 2 days 1 days 0 1 First day
16 Idaho 2021-02-04 2021-02-03 2021-02-05 1 days 1 days 1 1 Ongoing
17 Idaho 2021-02-05 2021-02-04 2021-02-08 1 days 3 days 1 0 Last day
18 Idaho 2021-02-08 2021-02-05 2021-02-10 3 days 2 days 0 0 Not linked
19 Idaho 2021-02-10 2021-02-08 2021-02-14 2 days 4 days 0 0 Not linked
20 Idaho 2021-02-14 2021-02-10 NA 4 days NA days 0 NA NA
What I want to create:
state date date_lag date_lead days_last days_next link_last link_next sequence cumulative duration name
<chr> <date> <date> <date> <drtn> <drtn> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
1 Washington 2021-01-01 NA 2021-01-03 NA days 2 days NA 0 NA NA 0 NA
2 Washington 2021-01-03 2021-01-01 2021-01-04 2 days 1 days 0 1 First day 1 2 Washington.1
3 Washington 2021-01-04 2021-01-03 2021-01-10 1 days 6 days 1 0 Last day 2 2 Washington.1
4 Washington 2021-01-10 2021-01-04 2021-01-15 6 days 5 days 0 0 Not linked NA 0 NA
5 Washington 2021-01-15 2021-01-10 2021-01-16 5 days 1 days 0 1 First day 1 5 Washington.2
6 Washington 2021-01-16 2021-01-15 2021-01-17 1 days 1 days 1 1 Ongoing 2 5 Washington.2
7 Washington 2021-01-17 2021-01-16 2021-01-18 1 days 1 days 1 1 Ongoing 3 5 Washington.2
8 Washington 2021-01-18 2021-01-17 2021-01-19 1 days 1 days 1 1 Ongoing 4 5 Washington.2
9 Washington 2021-01-19 2021-01-18 2021-01-28 1 days 9 days 1 0 Last day 5 5 Washington.2
10 Washington 2021-01-28 2021-01-19 NA 9 days NA days 0 NA NA NA NA NA
11 Idaho 2021-01-12 NA 2021-01-13 NA days 1 days NA 1 NA 1 3 Idaho.1
12 Idaho 2021-01-13 2021-01-12 2021-01-14 1 days 1 days 1 1 Ongoing 2 3 Idaho.1
13 Idaho 2021-01-14 2021-01-13 2021-02-01 1 days 18 days 1 0 Last day 3 3 Idaho.1
14 Idaho 2021-02-01 2021-01-14 2021-02-03 18 days 2 days 0 0 Not linked NA NA NA
15 Idaho 2021-02-03 2021-02-01 2021-02-04 2 days 1 days 0 1 First day 1 3 Idaho.2
16 Idaho 2021-02-04 2021-02-03 2021-02-05 1 days 1 days 1 1 Ongoing 2 3 Idaho.2
17 Idaho 2021-02-05 2021-02-04 2021-02-08 1 days 3 days 1 0 Last day 3 3 Idaho.2
18 Idaho 2021-02-08 2021-02-05 2021-02-10 3 days 2 days 0 0 Not linked NA NA NA
19 Idaho 2021-02-10 2021-02-08 2021-02-14 2 days 4 days 0 0 Not linked NA NA NA
20 Idaho 2021-02-14 2021-02-10 NA 4 days NA days 0 NA NA NA NA NA
Side question: Why is test$sequence[11] an NA and not "First day"?
I'm not sure these are the specific numbers you're looking for, but this represents what seems to me a simpler and more idiomatic tidyverse approach:
test %>%
group_by(state) %>%
mutate(days_last = as.numeric(date - lag(date)),
new_section = 1*(is.na(days_last) | days_last > 1), # EDIT
section = cumsum(new_section),
name = paste(state,section, sep = ".")) %>%
group_by(name) %>%
mutate(duration = as.numeric(max(date) - min(date) + 1),
sequence = case_when(duration == 1 ~ "Unlinked",
row_number() == 1 ~ "First Day",
row_number() == n() ~ "Last Day",
TRUE ~ "Ongoing")) %>%
ungroup()
Here, I mark any gap of more than one day as a new event, take the cumulative sum, and use that to define the duration of each event.
# A tibble: 20 x 8
state date days_last new_section section name duration sequence
<chr> <date> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
1 Washington 2021-01-01 NA 1 1 Washington.1 1 Unlinked
2 Washington 2021-01-03 2 1 2 Washington.2 2 First Day
3 Washington 2021-01-04 1 0 2 Washington.2 2 Last Day
4 Washington 2021-01-10 6 1 3 Washington.3 1 Unlinked
5 Washington 2021-01-15 5 1 4 Washington.4 5 First Day
6 Washington 2021-01-16 1 0 4 Washington.4 5 Ongoing
7 Washington 2021-01-17 1 0 4 Washington.4 5 Ongoing
8 Washington 2021-01-18 1 0 4 Washington.4 5 Ongoing
9 Washington 2021-01-19 1 0 4 Washington.4 5 Last Day
10 Washington 2021-01-28 9 1 5 Washington.5 1 Unlinked
11 Idaho 2021-01-12 NA 1 1 Idaho.1 3 First Day
12 Idaho 2021-01-13 1 0 1 Idaho.1 3 Ongoing
13 Idaho 2021-01-14 1 0 1 Idaho.1 3 Last Day
14 Idaho 2021-02-01 18 1 2 Idaho.2 1 Unlinked
15 Idaho 2021-02-03 2 1 3 Idaho.3 3 First Day
16 Idaho 2021-02-04 1 0 3 Idaho.3 3 Ongoing
17 Idaho 2021-02-05 1 0 3 Idaho.3 3 Last Day
18 Idaho 2021-02-08 3 1 4 Idaho.4 1 Unlinked
19 Idaho 2021-02-10 2 1 5 Idaho.5 1 Unlinked
20 Idaho 2021-02-14 4 1 6 Idaho.6 1 Unlinked
I think creating specific functions to do the counting is easier than try to do everything in a single pipe.
I left all the intermediate steps and the intermediate columns in the output so you can see what each step is doing. It's very likely you won't need to keep all these columns and you probably can simplify the steps once you understand the approach.
library(tidyverse) # Brings in dplyr
# Generate the data set of protests
test <- data.frame(state = c(rep("Washington", 10), rep("Idaho", 10)),
date = lubridate::ymd(c("2021-01-01", "2021-01-03", "2021-01-04", "2021-01-10", "2021-01-15",
"2021-01-16", "2021-01-17", "2021-01-18", "2021-01-19", "2021-01-28",
"2021-01-12", "2021-01-13", "2021-01-14", "2021-02-01", "2021-02-03",
"2021-02-04", "2021-02-05", "2021-02-08", "2021-02-10", "2021-02-14")))
event_count <- function(v){
cnt <- 0
result <- integer(length(v))
for(idx in seq_along(v)) {
if(v[idx]) {
cnt <- 0
} else {
cnt <- cnt + 1
}
result[idx] <- cnt
}
result
}
need_name <- function(cnt) {
result <- logical(length(cnt))
for(idx in seq_along(cnt)){
if(cnt[idx] == 0){
if(idx == length(cnt)){
result[idx] <- FALSE
break
}
result[idx] <- (cnt[idx + 1] != 0)
} else{
result[idx] <- TRUE
}
}
result
}
running_count <- function(v) {
cnt <- 0
flag <- FALSE
result <- integer(length(v))
for(idx in seq_along(v)){
if(v[idx]) {
if(!flag) {
cnt <- cnt + 1
flag <- !flag
}
result[idx] <- cnt
} else{
result[idx] <- 0
flag <- FALSE
}
}
result
}
test %>%
group_by(state) %>%
arrange(date, .by_group = TRUE) %>%
mutate(
duration = date - lag(date), # --- Compute durations
is_first = duration != 1 # --- Check if it is the first day of a protest
) %>%
replace_na(list(is_first = TRUE)) %>% # --- No more NAs
ungroup %>%
mutate(
cnt = event_count(is_first), # --- How many days does this event have?
need_name = need_name(cnt) # --- Should we name this event?
) %>%
group_by(state) %>%
mutate(
name_number = running_count(need_name) # --- What's the event count?
) %>%
mutate(
name = paste0(state, ".", name_number), # ---- Create names
cumulative = cnt + 1 # --- Start counting from one instead of zero
) %>%
group_by(name) %>%
mutate(
duration = max(duration) # --- Calc total duration
) %>%
ungroup() %>%
mutate( # --- Adding the NAs back
name = if_else(name_number == 0, NA_character_, name),
duration = if_else(name_number == 0, NA_integer_, as.integer(duration)),
cumulative = if_else(name_number == 0, NA_integer_, as.integer(cumulative)),
)
data.table::rleid is useful here to create run lengths based on the condition if days_last == 1 or days_next == 1 (ie sequential dates). If you want different event lengths you can edit that condition.
library(dplyr)
library(data.table)
test %>%
dplyr::group_by(state) %>%
dplyr::mutate(days_last = c(NA, diff(date)),
days_next = as.numeric(lead(date) - date),
name = paste0(state, ".", data.table::rleid(days_last == 1 | days_next == 1))) %>%
dplyr::group_by(name) %>%
dplyr::mutate(sequence = case_when(
n() == 1 ~ "Not Linked",
row_number() == 1 ~ "First day",
n() == row_number() ~ "Last day",
T ~ "Ongoing"),
duration = n(),
cumulative = seq_along(name)) %>%
dplyr::ungroup()
Output
state date days_last days_next name sequence duration cumulative
<chr> <date> <dbl> <dbl> <chr> <chr> <int> <int>
1 Washington 2021-01-01 NA 2 Washington.1 Not Linked 1 1
2 Washington 2021-01-03 2 1 Washington.2 First day 2 1
3 Washington 2021-01-04 1 6 Washington.2 Last day 2 2
4 Washington 2021-01-10 6 5 Washington.3 Not Linked 1 1
5 Washington 2021-01-15 5 1 Washington.4 First day 5 1
6 Washington 2021-01-16 1 1 Washington.4 Ongoing 5 2
7 Washington 2021-01-17 1 1 Washington.4 Ongoing 5 3
8 Washington 2021-01-18 1 1 Washington.4 Ongoing 5 4
9 Washington 2021-01-19 1 9 Washington.4 Last day 5 5
10 Washington 2021-01-28 9 NA Washington.5 Not Linked 1 1
11 Idaho 2021-01-12 NA 1 Idaho.1 First day 3 1
12 Idaho 2021-01-13 1 1 Idaho.1 Ongoing 3 2
13 Idaho 2021-01-14 1 18 Idaho.1 Last day 3 3
14 Idaho 2021-02-01 18 2 Idaho.2 Not Linked 1 1
15 Idaho 2021-02-03 2 1 Idaho.3 First day 3 1
16 Idaho 2021-02-04 1 1 Idaho.3 Ongoing 3 2
17 Idaho 2021-02-05 1 3 Idaho.3 Last day 3 3
18 Idaho 2021-02-08 3 2 Idaho.4 First day 2 1
19 Idaho 2021-02-10 2 4 Idaho.4 Last day 2 2
20 Idaho 2021-02-14 4 NA Idaho.5 Not Linked 1 1
If need by you can use the NA in the days_last column to NA values in other rows.
Side question: Why is test$sequence[11] an NA and not "First day"?
Generally, in R NA propagates, meaning if NA is part of the evaluation then normally NA is returned. When you define sequence your first ifelse condition is link_last == 0 & link_next == 1. On row 11, link_last = NA and link_next = 1. So what you're evaluating is:
NA == 0 & 1 == 1
[1] NA
Instead your nested condition should come first. How your ifelse is currently written that nested condition is not being evaluated:
is.na(NA) & 1 == 1
[1] TRUE
Here is a data.table approach.
library(data.table)
# Convert from data.frame to data.table
setDT(test)
# Subset the variables.
test2 <- test[, .(state, date, days_last = as.numeric(days_last),
days_next = as.numeric(days_next), sequence)]
# Code
test2[, name := paste0(state, '.', rleid(days_last == 1 | days_next == 1)),
by = state][
, ':='(duration = .N,
cumulative = seq(1:.N)),
by = name
][, c('days_next', 'days_last'):=NULL] # Removing these variables. Feel free to add back!
# Reorder the variables
test2 <- setcolorder(test2, c('state', 'name', 'date',
'sequence', 'duration',
'cumulative'))
# Print first 15 rows
print(test2[1:15,])
#> state name date sequence duration cumulative
#> 1: Washington Washington.1 2021-01-01 <NA> 1 1
#> 2: Washington Washington.2 2021-01-03 First day 2 1
#> 3: Washington Washington.2 2021-01-04 Last day 2 2
#> 4: Washington Washington.3 2021-01-10 Not linked 1 1
#> 5: Washington Washington.4 2021-01-15 First day 5 1
#> 6: Washington Washington.4 2021-01-16 Ongoing 5 2
#> 7: Washington Washington.4 2021-01-17 Ongoing 5 3
#> 8: Washington Washington.4 2021-01-18 Ongoing 5 4
#> 9: Washington Washington.4 2021-01-19 Last day 5 5
#> 10: Washington Washington.5 2021-01-28 <NA> 1 1
#> 11: Idaho Idaho.1 2021-01-12 <NA> 3 1
#> 12: Idaho Idaho.1 2021-01-13 Ongoing 3 2
#> 13: Idaho Idaho.1 2021-01-14 Last day 3 3
#> 14: Idaho Idaho.2 2021-02-01 Not linked 1 1
#> 15: Idaho Idaho.3 2021-02-03 First day 3 1
Created on 2021-03-16 by the reprex package (v0.3.0)
I am new at R, my df is as the following and I would like to set my bench comparison date as 2020/02/01, compare the results against the row with this date:
Here is my data frame, I want to be able to genearte the Diff Column with R
DATE
FRUIT
LOCATION
VALUE
DIFF
2010-01-01
Apple
USA
2
-2
2010-02-01
Apple
USA
4
0
2020-11-01
Apple
USA
100
96
2020-12-01
Apple
USA
54
50
2010-01-01
Apple
China
0
-4
2010-02-01
Apple
China
4
0
2020-11-01
Apple
China
40
36
2020-12-01
Apple
China
44
40
2010-01-01
Banana
USA
1
-1
2010-02-01
Banana
USA
2
0
2020-11-01
Banana
USA
12
10
2020-12-01
Banana
USA
13
11
2010-01-01
Banana
China
0
-100
2010-02-01
Banana
China
100
0
2020-11-01
Banana
China
130
30
2020-12-01
Banana
China
145
45
Thank you!
Using dplyr you can do :
library(dplyr)
compare_date <- as.Date('2010-02-01')
df %>%
mutate(Date = as.Date(Date)) %>%
group_by(Fruit, Metric) %>%
mutate(Diff = Value - Value[match(compare_date, Date)]) -> result
result
I have two data frames that I'd like to join them by the dates
df1 <-
data.frame(
day = seq(ymd("2020-01-01"), ymd("2020-01-14"), by = "1 day"),
key = rep(c("green", "blue"), 7),
value_x = sample(1:100, 14)
) %>%
as_tibble()
df2 <-
data.frame(
day = seq(ymd("2020-01-01"), ymd("2020-01-12"), by = "3 days"),
key = rep(c("green", "blue"), 2),
value_y = c(2, 4, 6, 8)
) %>%
as_tibble()
I want the output to be like this
# A tibble: 14 x 3
day key value_x value_y
<date> <fct> <int> <int>
1 2020-01-01 green 91 2
2 2020-01-02 blue 28 NA
3 2020-01-03 green 75 2
4 2020-01-04 blue 14 4
5 2020-01-05 green 3 2
6 2020-01-06 blue 27 4
7 2020-01-07 green 15 6
8 2020-01-08 blue 7 4
9 2020-01-09 green 1 6
10 2020-01-10 blue 10 8
11 2020-01-11 green 9 6
12 2020-01-12 blue 76 8
13 2020-01-13 green 31 6
14 2020-01-14 blue 62 8
I tried doing this code
merge(df1, df2, by = c("day", "key"), all.x = TRUE)
I'd like the day in the left table to join to the most recent day in the Y table that has a value. If there is no value, then it should be NA.
Edit --
Not all the dates in df2 will appear in df1 while they do have a common ID. This is an example-
df1
day id key
1 2020-01-08 A green
2 2020-01-10 A green
3 2020-02-24 A blue
4 2020-03-24 A green
df2
day id value
1 2020-01-03 A 2
2 2020-01-07 A 4
3 2020-01-22 A 4
4 2020-03-24 A 6
desired output
day id key value
1 2020-01-08 A green 4
2 2020-01-10 A green 4
3 2020-02-24 A blue 4
4 2020-03-24 A green 6
After merging, you can arrange the data based on key and day and fill with the most recent non-NA value.
library(dplyr)
merge(df1, df2, by = c('day', 'key'), all.x = TRUE) %>%
arrange(key, day) %>%
group_by(key) %>%
tidyr::fill(value_y) %>%
arrange(day)
# day key value_x value_y
#1 2020-01-01 green 40 2
#2 2020-01-02 blue 45 NA
#3 2020-01-03 green 54 2
#4 2020-01-04 blue 11 4
#5 2020-01-05 green 12 2
#6 2020-01-06 blue 7 4
#7 2020-01-07 green 72 6
#8 2020-01-08 blue 76 4
#9 2020-01-09 green 52 6
#10 2020-01-10 blue 32 8
#11 2020-01-11 green 69 6
#12 2020-01-12 blue 10 8
#13 2020-01-13 green 63 6
#14 2020-01-14 blue 84 8
For the updated data you can use the following :
df1 %>%
left_join(df2, by = 'id') %>%
mutate(diff = day.x - day.y) %>%
group_by(id, key, day.x) %>%
filter(diff == min(diff[diff >= 0])) %>%
arrange(day.x) %>%
select(day = day.x, id, key, value)
# day id key value
# <date> <chr> <chr> <int>
#1 2020-01-08 A green 4
#2 2020-01-10 A green 4
#3 2020-02-24 A blue 4
#4 2020-03-24 A green 6
A dataset describes multiple repeating measurements for multiple clusters, with each measurement-cluster pair contained in a single column. I would like to wrangle the data into a long(er) format, such that one column provides information on the cluster, but each measurement remains in its own column.
# Current format
df_wider <- data.frame(
id = 1:5,
fruit_1 = sample(fruit, size = 5),
date_1 = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/01'), by="day"), 5),
number_1 = sample(1:100, 5),
fruit_2 = sample(fruit, size = 5),
date_2 = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/01'), by="day"), 5),
number_2 = sample(1:100, 5),
fruit_3 = sample(fruit, size = 5),
date_3 = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/01'), by="day"), 5),
number_3 = sample(1:100, 5)
)
# Desired format
df_longer <- data.frame(
id = rep(1:5, each = 3),
cluster = rep(1:3, 5),
fruit = sample(fruit, size = 15),
date = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/01'), by="day"), 15),
number = sample(1:100, 15)
)
The real dataset contains up to 25 clusters of 100s of measurements each. I attempted to use tidyr::gather() and tidyr::pivot_longer() iterated over each measurement, but the resulting intermediate dataframes increased exponentially in size. Attempting to do so in a single tidyr::pivot_longer() step is impossible due to the values' being of different class. I am unable to think of a way to vectorize this up to scale.
You could do:
library(tidyr)
library(dplyr)
df_wider %>% pivot_longer(-id,
names_pattern = "(.*)_(\\d)",
names_to = c(".value", "cluster"))
# A tibble: 15 x 5
id cluster fruit date number
<int> <chr> <fct> <date> <int>
1 1 1 olive 2020-04-21 50
2 1 2 elderberry 2020-02-23 59
3 1 3 cherimoya 2020-03-07 9
4 2 1 jujube 2020-03-22 88
5 2 2 mandarine 2020-03-06 45
6 2 3 grape 2020-04-23 78
7 3 1 nut 2020-01-26 53
8 3 2 cantaloupe 2020-01-27 70
9 3 3 durian 2020-02-15 39
10 4 1 chili pepper 2020-03-17 60
11 4 2 raisin 2020-04-14 20
12 4 3 cloudberry 2020-03-11 4
13 5 1 honeydew 2020-01-04 81
14 5 2 lime 2020-03-23 53
15 5 3 ugli fruit 2020-01-13 26
We can use melt from data.table
library(data.table)
melt(setDT(df_wider), measure = patterns('^fruit', '^date', '^number' ),
value.name = c('fruit', 'date', 'number'), variable.name = 'cluster')
# id cluster fruit date number
# 1: 1 1 date 2020-04-16 17
# 2: 2 1 quince 2020-01-27 7
# 3: 3 1 coconut 2020-04-19 33
# 4: 4 1 pomegranate 2020-02-27 55
# 5: 5 1 persimmon 2020-02-20 62
# 6: 1 2 kiwi fruit 2020-01-14 100
# 7: 2 2 cranberry 2020-03-15 97
# 8: 3 2 cucumber 2020-03-16 5
# 9: 4 2 persimmon 2020-03-06 81
#10: 5 2 date 2020-04-17 30
#11: 1 3 apricot 2020-04-13 86
#12: 2 3 banana 2020-04-17 42
#13: 3 3 bilberry 2020-02-23 88
#14: 4 3 blackcurrant 2020-02-25 10
#15: 5 3 raisin 2020-02-09 87
here's some dummy data:
user_id date category
27 2016-01-01 apple
27 2016-01-03 apple
27 2016-01-05 pear
27 2016-01-07 plum
27 2016-01-10 apple
27 2016-01-14 pear
27 2016-01-16 plum
11 2016-01-01 apple
11 2016-01-03 pear
11 2016-01-05 pear
11 2016-01-07 pear
11 2016-01-10 apple
11 2016-01-14 apple
11 2016-01-16 apple
I'd like to calculate for each user_id the number of distinct categories in the specified time period (e.g. in the past 7, 14 days), including the current order
The solution would look like this:
user_id date category distinct_7 distinct_14
27 2016-01-01 apple 1 1
27 2016-01-03 apple 1 1
27 2016-01-05 pear 2 2
27 2016-01-07 plum 3 3
27 2016-01-10 apple 3 3
27 2016-01-14 pear 3 3
27 2016-01-16 plum 3 3
11 2016-01-01 apple 1 1
11 2016-01-03 pear 2 2
11 2016-01-05 pear 2 2
11 2016-01-07 pear 2 2
11 2016-01-10 apple 2 2
11 2016-01-14 apple 2 2
11 2016-01-16 apple 1 2
I posted similar questions here or here, however none of it referred to counting cumulative unique values for the specified time period. Thanks a lot for your help!
I recommend using runner package. You can use any R function on running windows with runner function. Code below obtains desided output, which is past 7-days + current and past 14-days + current (current 8 and 15 days):
df <- read.table(
text = " user_id date category
27 2016-01-01 apple
27 2016-01-03 apple
27 2016-01-05 pear
27 2016-01-07 plum
27 2016-01-10 apple
27 2016-01-14 pear
27 2016-01-16 plum
11 2016-01-01 apple
11 2016-01-03 pear
11 2016-01-05 pear
11 2016-01-07 pear
11 2016-01-10 apple
11 2016-01-14 apple
11 2016-01-16 apple", header = TRUE, colClasses = c("integer", "Date", "character"))
library(dplyr)
library(runner)
df %>%
group_by(user_id) %>%
mutate(distinct_7 = runner(category, k = 7 + 1, idx = date,
f = function(x) length(unique(x))),
distinct_14 = runner(category, k = 14 + 1, idx = date,
f = function(x) length(unique(x))))
More informations in package and function documentation.
Here are two data.table solutions, one with two nested lapplyand the other using non-equi joins.
The first one is a rather clumsy data.table solution but it reproduces the expected answer. And it would work for an arbitrary number of time frames. (Although #alistaire's concise tidyverse solution he had suggested in his comment could be modified as well).
It uses two nested lapply. The first one loops over the time frames, the second one over the dates. The tempory result is joined with the original data and then reshaped from long to wide format so that we will end with a separate column for each of the time frames.
library(data.table)
tmp <- rbindlist(
lapply(c(7L, 14L),
function(ldays) rbindlist(
lapply(unique(dt$date),
function(ldate) {
dt[between(date, ldate - ldays, ldate),
.(distinct = sprintf("distinct_%02i", ldays),
date = ldate,
N = uniqueN(category)),
by = .(user_id)]
})
)
)
)
dcast(tmp[dt, on=c("user_id", "date")],
... ~ distinct, value.var = "N")[order(-user_id, date, category)]
# date user_id category distinct_07 distinct_14
# 1: 2016-01-01 27 apple 1 1
# 2: 2016-01-03 27 apple 1 1
# 3: 2016-01-05 27 pear 2 2
# 4: 2016-01-07 27 plum 3 3
# 5: 2016-01-10 27 apple 3 3
# 6: 2016-01-14 27 pear 3 3
# 7: 2016-01-16 27 plum 3 3
# 8: 2016-01-01 11 apple 1 1
# 9: 2016-01-03 11 pear 2 2
#10: 2016-01-05 11 pear 2 2
#11: 2016-01-07 11 pear 2 2
#12: 2016-01-10 11 apple 2 2
#13: 2016-01-14 11 apple 2 2
#14: 2016-01-16 11 apple 1 2
Here is a variant following a suggestion by #Frank which uses data.table's non-equi joins instead of the second lapply:
tmp <- rbindlist(
lapply(c(7L, 14L),
function(ldays) {
dt[.(user_id = user_id, dago = date - ldays, d = date),
on=.(user_id, date >= dago, date <= d),
.(distinct = sprintf("distinct_%02i", ldays),
N = uniqueN(category)),
by = .EACHI]
}
)
)[, date := NULL]
#
dcast(tmp[dt, on=c("user_id", "date")],
... ~ distinct, value.var = "N")[order(-user_id, date, category)]
Data:
dt <- fread("user_id date category
27 2016-01-01 apple
27 2016-01-03 apple
27 2016-01-05 pear
27 2016-01-07 plum
27 2016-01-10 apple
27 2016-01-14 pear
27 2016-01-16 plum
11 2016-01-01 apple
11 2016-01-03 pear
11 2016-01-05 pear
11 2016-01-07 pear
11 2016-01-10 apple
11 2016-01-14 apple
11 2016-01-16 apple")
dt[, date := as.IDate(date)]
BTW: The wording in the past 7, 14 days is somewhat misleading as the time periods actually consist of 8 and 15 days, resp.
In the tidyverse, you can use map_int to iterate over a set of values and simplify to an integer à la sapply or vapply. Count distinct occurrences with n_distinct (like length(unique(...))) of an object subset by comparisons or the helper between, with a minimum set by the appropriate amount subtracted from that day, and you're set.
library(tidyverse)
df %>% group_by(user_id) %>%
mutate(distinct_7 = map_int(date, ~n_distinct(category[between(date, .x - 7, .x)])),
distinct_14 = map_int(date, ~n_distinct(category[between(date, .x - 14, .x)])))
## Source: local data frame [14 x 5]
## Groups: user_id [2]
##
## user_id date category distinct_7 distinct_14
## <int> <date> <fctr> <int> <int>
## 1 27 2016-01-01 apple 1 1
## 2 27 2016-01-03 apple 1 1
## 3 27 2016-01-05 pear 2 2
## 4 27 2016-01-07 plum 3 3
## 5 27 2016-01-10 apple 3 3
## 6 27 2016-01-14 pear 3 3
## 7 27 2016-01-16 plum 3 3
## 8 11 2016-01-01 apple 1 1
## 9 11 2016-01-03 pear 2 2
## 10 11 2016-01-05 pear 2 2
## 11 11 2016-01-07 pear 2 2
## 12 11 2016-01-10 apple 2 2
## 13 11 2016-01-14 apple 2 2
## 14 11 2016-01-16 apple 1 2