I'm trying to mark all dates, which fall within several ranges in a different table.
The events table among other variables contains start_date and end_date of events:
events <- tibble(
name = c("Event A", "Event B"),
start_date = as.Date(c("2021-10-17", "2021-02-19")),
end_date = as.Date(c("2021-10-19", "2021-02-10"))
)
The date_info table contains date, statistic and value information in the long format for all days of the year:
date_info <- tibble(
date = as.Date(c("2021-10-16", "2021-10-16", "2021-10-17", "2021-10-17")),
statistic = c("var1", "var2", "var1", "var2"),
value = c(10, 54, 23, 34)
)
I need to make a new column in date_info to mark dates which fall within any date range of events.
I've tried the approach below, but it works only if there is one event in events
library(tidyverse)
date_info %>%
mutate(in_range = if_else(date < events$start_date | date > events$end_date, FALSE, TRUE))
I thought about creating a date_range vector in events such that code below can be used to mark the dates:
library(tidyverse)
date_info %>%
mutate(in_range = if_else(date %in% events$date_range, TRUE, FALSE))
However I'm not sure that this is the best approach. Additionally I'm not sure how to get such date range as seq() works on a single start/end date pair rather than a vector.
This can be done as a range-based or non-equi join. Unfortunately, dplyr alone cannot do it, but one of the following should work fine.
The code below assigns the particular events$name to each row, not just an "in range" indicator. It's not hard to simplify that with in_range = !is.na(name) or similar.
fuzzyjoin
# library(fuzzyjoin)
date_info %>%
fuzzyjoin::fuzzy_left_join(events,
by = c(date = "start_date", date = "end_date"),
match_fun = list(`>=`, `<=`))
# # A tibble: 4 x 6
# date statistic value name start_date end_date
# <date> <chr> <dbl> <chr> <date> <date>
# 1 2021-10-16 var1 10 NA NA NA
# 2 2021-10-16 var2 54 NA NA NA
# 3 2021-10-17 var1 23 Event A 2021-10-17 2021-10-19
# 4 2021-10-17 var2 34 Event A 2021-10-17 2021-10-19
sqldf
# library(sqldf)
sqldf::sqldf("
select t1.*, t2.name
from date_info t1
left join events t2 on t1.date between t2.start_date and t2.end_date")
# date statistic value name
# 1 2021-10-16 var1 10 <NA>
# 2 2021-10-16 var2 54 <NA>
# 3 2021-10-17 var1 23 Event A
# 4 2021-10-17 var2 34 Event A
data.table
library(data.table)
date_info_DT <- as.data.table(date_info)
events_DT <- as.data.table(events)
date_info_DT[events_DT, name := i.name,
on = .(date >= start_date, date <= end_date)][]
# date statistic value name
# <Date> <char> <num> <char>
# 1: 2021-10-16 var1 10 <NA>
# 2: 2021-10-16 var2 54 <NA>
# 3: 2021-10-17 var1 23 Event A
# 4: 2021-10-17 var2 34 Event A
(There's also data.table::foverlaps, which requires the second data.table to be keyed.)
Another option, a bit simpler (not requiring class-changes):
date_info %>%
mutate(in_range = data.table::inrange(date, events$start_date, events$end_date))
# # A tibble: 4 x 4
# date statistic value in_range
# <date> <chr> <dbl> <lgl>
# 1 2021-10-16 var1 10 FALSE
# 2 2021-10-16 var2 54 FALSE
# 3 2021-10-17 var1 23 TRUE
# 4 2021-10-17 var2 34 TRUE
Here's a solution using map from the purrr package that should work. It could be more concise but I made it very explicit so it's not overwhelming if you're not familiar with the syntax.
date_info |>
mutate(
in_range_n = map_dbl(date, .f = function(date){
filter(events, start_date <= date, end_date >= date) |>
nrow()
}),
in_range = in_range_n > 0
) |>
select(-in_range_n)
Output:
# A tibble: 4 x 4
date statistic value in_range
<date> <chr> <dbl> <lgl>
1 2021-10-16 var1 10 FALSE
2 2021-10-16 var2 54 FALSE
3 2021-10-17 var1 23 TRUE
4 2021-10-17 var2 34 TRUE
Let me know if I misunderstood the problem!
Using base r
date_info$in_range <- sapply(date_info$date, function(date){
any(date >= events$start_date & date <= events$end_date)
})
gives
date statistic value in_range
<date> <chr> <dbl> <lgl>
1 2021-10-16 var1 10 FALSE
2 2021-10-16 var2 54 FALSE
3 2021-10-17 var1 23 TRUE
4 2021-10-17 var2 34 TRUE
Related
I am trying to join two data.frames based on a condition. Consider the following situation where I have df_a and df_b.
library(tidyverse)
# Dummy data A
df_a <- tibble(
id = c("a", "b", "c", "a"),
text = c("hi","why", "bye","cry"),
created_date = c(as.Date("2020-01-01"), as.Date("2020-02-02"), as.Date("2020-03-01"), as.Date("2020-04-04"))
)
# Dummy data B
df_b <- tibble(
id = c("a", "b", "c", "a"),
group = c("GROUP A","GROUP B","GROUP C", "GROUP C"),
start_date = c(as.Date("2020-01-01"), as.Date("2020-01-01"), as.Date("2020-01-01"), as.Date("2020-02-04"))
)
> df_a
# A tibble: 4 x 3
id text created_date
<chr> <chr> <date>
1 a hi 2020-01-01
2 b why 2020-02-02
3 c bye 2020-03-01
4 a cry 2020-04-04
> df_b
# A tibble: 4 x 3
id group start_date
<chr> <chr> <date>
1 a GROUP A 2020-01-01
2 b GROUP B 2020-01-01
3 c GROUP C 2020-01-01
4 a GROUP C 2020-02-04
# Current solution
d_current_sol <- df_a %>%
left_join(
df_b %>%
distinct(id, .keep_all = T), by = "id"
)
> d_current_sol
# A tibble: 4 x 5
id text created_date group start_date
<chr> <chr> <date> <chr> <date>
1 a hi 2020-01-01 GROUP A 2020-01-01
2 b why 2020-02-02 GROUP B 2020-01-01
3 c bye 2020-03-01 GROUP C 2020-01-01
4 a cry 2020-04-04 GROUP A 2020-01-01
# Desired solution
> d_desired
# A tibble: 4 x 5
id text created_date start_date group
<chr> <chr> <date> <date> <chr>
1 a hi 2020-01-01 2020-01-01 GROUP A
2 b why 2020-02-02 2020-01-01 GROUP B
3 c bye 2020-03-01 2020-01-01 GROUP C
4 a cry 2020-04-04 2020-02-04 GROUP C
As you can see in df_b the start date for id = a is first 2020-01-01 and then 2020-02-04.
What I would like is for any row in df_a after 2020-02-04, it's associated group to be "GROUP C". This is illustrated in the final data frame d_desired
However, with a traditional left_join where we match the row to a single row item, we only get the earliest entry (d_current_sol). This would be easy enough to do using map() and using a custom function with mutate(), however, SQL doesn't support that when translating dplyr queries.
Does anyone know a way of doing this through the tidyverse paradigm? As I will be using dbplyr to run this on a PostgreSQL database.
this?
df_a %>%
left_join(df_b, by = "id" ) %>%
filter( created_date >= start_date) %>%
group_by(id, created_date) %>%
top_n(1,start_date)
# A tibble: 4 x 5
# Groups: id, created_date [4]
id text created_date group start_date
<chr> <chr> <date> <chr> <date>
1 a hi 2020-01-01 GROUP A 2020-01-01
2 b why 2020-02-02 GROUP B 2020-01-01
3 c bye 2020-03-01 GROUP C 2020-01-01
4 a cry 2020-04-04 GROUP C 2020-02-04
I've got data from a number of surveys. Each survey can be sent multiple times with updated values. For each survey/row in the dataset there's a date when the survey was submited (created). I'd like to merge the rows for each survey and keep the date from the first survey but other data from the last survey.
A simple example:
#> survey created var1 var2
#> 1 s1 2020-01-01 10 30
#> 2 s2 2020-01-02 10 90
#> 3 s2 2020-01-03 20 20
#> 4 s3 2020-01-01 45 5
#> 5 s3 2020-01-02 50 50
#> 6 s3 2020-01-03 30 10
Desired result:
#> survey created var1 var2
#> 1 s1 2020-01-01 10 30
#> 2 s2 2020-01-02 20 20
#> 3 s3 2020-01-01 30 10
Example data:
df <- data.frame(survey = c("s1", "s2", "s2", "s3", "s3", "s3"),
created = as.POSIXct(c("2020-01-01", "2020-01-02", "2020-01-03", "2020-01-01", "2020-01-02", "2020-01-03"), "%Y-%m-%d", tz = "GMT"),
var1 = c(10, 10, 20, 45, 50, 30),
var2 = c(30, 90, 20, 5, 50, 10),
stringsAsFactors=FALSE)
I've tried group_by with summarize in different ways but can't make it work, any help would be highly appreciated!
After grouping by 'survey', change the 'created' as the first or min value in 'created' and then slice the last row (n())
library(dplyr)
df %>%
group_by(survey) %>%
mutate(created = as.Date(first(created))) %>%
slice(n())
# A tibble: 3 x 4
# Groups: survey [3]
# survey created var1 var2
# <chr> <date> <dbl> <dbl>
#1 s1 2020-01-01 10 30
#2 s2 2020-01-02 20 20
#3 s3 2020-01-01 30 10
Or using base R
transform(df, created = ave(created, survey, FUN = first)
)[!duplicated(df$survey, fromLast = TRUE),]
After selecting the first created date we can select the last values from all the columns.
library(dplyr)
df %>%
group_by(survey) %>%
mutate(created = as.Date(first(created))) %>%
summarise(across(created:var2, last))
#In older version use `summarise_at`
#summarise_at(vars(created:var2), last)
# A tibble: 3 x 4
# survey created var1 var2
# <chr> <date> <dbl> <dbl>
#1 s1 2020-01-01 10 30
#2 s2 2020-01-02 20 20
#3 s3 2020-01-01 30 10
I have a very large data set of time intervals (start and end date values), and need to calculate for every single day within the entire range of those dates how many of the intervals include the date.
In essence, I want to know how many people are in jail any given day. I have when they entered, and when they left. I need to be able to determine how many were in jail on every single day over a period of many years.
Example data:
require(tidyverse)
require(lubridate)
x <- tribble(~start, ~end,
today()-5, today()-3,
today()-4, today()-2,
today()-3, today()-1)
x <- x %>% mutate(dtint = interval(start, end))
x
#> # A tibble: 3 x 3
#> start end dtint
#> <date> <date> <Interval>
#> 1 2019-10-13 2019-10-15 2019-10-13 UTC--2019-10-15 UTC
#> 2 2019-10-14 2019-10-16 2019-10-14 UTC--2019-10-16 UTC
#> 3 2019-10-15 2019-10-17 2019-10-15 UTC--2019-10-17 UTC
mydays <- seq(min(x$start), max(x$end), by = "day") %>% enframe(name = NULL, value = "eachday")
mydays
#> # A tibble: 5 x 1
#> eachday
#> <date>
#> 1 2019-10-13
#> 2 2019-10-14
#> 3 2019-10-15
#> 4 2019-10-16
#> 5 2019-10-17
#Expected result:
mydays %>% add_column(expected_result = c(1, 2, 3, 2, 1))
#> # A tibble: 5 x 2
#> eachday expected_result
#> <date> <dbl>
#> 1 2019-10-13 1
#> 2 2019-10-14 2
#> 3 2019-10-15 3
#> 4 2019-10-16 2
#> 5 2019-10-17 1
Created on 2019-10-18 by the reprex package (v0.3.0)
I will also need the ability to do this on grouped tibbles so I can calculate things like totals by demographic properties.
Is there an efficient way to do this in tidyverse/lubridate?
Here's an approach of how you could track totals using the cumulative count across each grouped demographic.
# Example data
library(tidyverse)
set.seed(42)
x <- tibble(demographic = sample(LETTERS[1:3], 100, replace = T),
start = as.Date("2019-01-01") + runif(100, 0, 30),
end = start + runif(100, 1, 50))
## A tibble: 6 x 3
# demographic start end
# <chr> <date> <date>
#1 C 2019-01-19 2019-03-05
#2 C 2019-01-07 2019-02-02
#3 A 2019-01-07 2019-02-19
#4 C 2019-01-12 2019-02-04
#5 B 2019-01-29 2019-02-07
#6 B 2019-01-29 2019-02-21
First we bring it into long format. Then count each start date as incrementing up, each end date incrementing down. Then we group by demographic and take the cumulative total of those increments up and down.
x %>%
pivot_longer(-demographic, "col", values_to = "date") %>%
mutate(change = if_else(col == "start", 1, -1)) %>%
arrange(demographic, date) %>%
group_by(demographic) %>%
mutate(count = cumsum(change)) %>%
ungroup() %>%
ggplot(aes(date, count, color = demographic)) +
geom_step()
One option involving dplyr, tidyr and lubridate could be:
x %>%
mutate(eachday = list(seq.Date(min(start), max(end), by = "1 day"))) %>%
unnest(eachday) %>%
group_by(eachday) %>%
summarise(overlap = sum(int_overlaps(dtint, interval(eachday, eachday))))
eachday overlap
<date> <int>
1 2019-10-13 1
2 2019-10-14 2
3 2019-10-15 3
4 2019-10-16 2
5 2019-10-17 1
First, it creates a list of dates between the first and last date in the data and unnest it. Then, it groups by the dates and sums the overlap between the dtint interval and the interval for dates.
Here is an option with data.table. Convert the 'data.frame' to 'data.table' (setDT0, create the seq column from min of 'start' and max of 'end' , then join with the original dataset using a non-equi join and get the count (.N)
library(data.table)
setDT(x)[x[, .(eachday = seq(min(start), max(end), by = '1 day'))],
.(eachday, overlap = .N), on = .(start <= eachday,
end >= eachday ), by = .EACHI][, .(eachday, overlap)]
# eachday overlap
#1: 2019-10-13 1
#2: 2019-10-14 2
#3: 2019-10-15 3
#4: 2019-10-16 2
#5: 2019-10-17 1
The dataframe df1 summarizes detections of individuals (ID) through the time (Date). As a short example:
df1<- data.frame(ID= c(1,2,1,2,1,2,1,2,1,2),
Date= ymd(c("2016-08-21","2016-08-24","2016-08-23","2016-08-29","2016-08-27","2016-09-02","2016-09-01","2016-09-09","2016-09-01","2016-09-10")))
df1
ID Date
1 1 2016-08-21
2 2 2016-08-24
3 1 2016-08-23
4 2 2016-08-29
5 1 2016-08-27
6 2 2016-09-02
7 1 2016-09-01
8 2 2016-09-09
9 1 2016-09-01
10 2 2016-09-10
I want to summarize either the Number of days since the first detection of the individual (Ndays) and Number of days that the individual has been detected since the first time it was detected (Ndifdays).
Additionally, I would like to include in this summary table a variable called Prop that simply divides Ndifdays between Ndays.
The summary table that I would expect would be this:
> Result
ID Ndays Ndifdays Prop
1 1 11 4 0.360 # Between 21st Aug and 01st Sept there is 11 days.
2 2 17 5 0.294 # Between 24th Aug and 10st Sept there is 17 days.
Does anyone know how to do it?
You could achieve using various summarising functions in dplyr
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise(Ndays = as.integer(max(Date) - min(Date)),
Ndifdays = n_distinct(Date),
Prop = Ndifdays/Ndays)
# ID Ndays Ndifdays Prop
# <dbl> <int> <int> <dbl>
#1 1 11 4 0.364
#2 2 17 5 0.294
The data.table version of this would be
library(data.table)
df12 <- setDT(df1)[, .(Ndays = as.integer(max(Date) - min(Date)),
Ndifdays = uniqueN(Date)), by = ID]
df12$Prop <- df12$Ndifdays/df12$Ndays
and base R with aggregate
df12 <- aggregate(Date~ID, df1, function(x) c(max(x) - min(x), length(unique(x))))
df12$Prop <- df1$Ndifdays/df1$Ndays
After grouping by 'ID', get the diff or range of 'Date' to create 'Ndays', and then get the unique number of 'Date' with n_distinct, divide by the number of distinct by the Ndays to get the 'Prop'
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise(Ndays = as.integer(diff(range(Date))),
Ndifdays = n_distinct(Date),
Prop = Ndifdays/Ndays)
# A tibble: 2 x 4
# ID Ndays Ndifdays Prop
# <dbl> <int> <int> <dbl>
#1 1 11 4 0.364
#2 2 17 5 0.294
I have a data in which I have 2 fields in a table sf -> Customer id and Buy_date. Buy_date is unique but for each customer, but there can be more than 3 different values of Buy_dates for each customer. I want to calculate difference in consecutive Buy_date for each Customer and its mean value. How can I do this.
Example
Customer Buy_date
1 2018/03/01
1 2018/03/19
1 2018/04/3
1 2018/05/10
2 2018/01/02
2 2018/02/10
2 2018/04/13
I want the results for each customer in the format
Customer mean
Here's a dplyr solution.
Your data:
df <- data.frame(Customer = c(1,1,1,1,2,2,2), Buy_date = c("2018/03/01", "2018/03/19", "2018/04/3", "2018/05/10", "2018/01/02", "2018/02/10", "2018/04/13"))
Grouping, mean Buy_date calculation and summarising:
library(dplyr)
df %>% group_by(Customer) %>% mutate(mean = mean(as.POSIXct(Buy_date))) %>% group_by(Customer, mean) %>% summarise()
Output:
# A tibble: 2 x 2
# Groups: Customer [?]
Customer mean
<dbl> <dttm>
1 1 2018-03-31 06:30:00
2 2 2018-02-17 15:40:00
Or as #r2evans points out in his comment for the consecutive days between Buy_dates:
df %>% group_by(Customer) %>% mutate(mean = mean(diff(as.POSIXct(Buy_date)))) %>% group_by(Customer, mean) %>% summarise()
Output:
# A tibble: 2 x 2
# Groups: Customer [?]
Customer mean
<dbl> <time>
1 1 23.3194444444444
2 2 50.4791666666667
I am not exactly sure of the desired output but this what I think you want.
library(dplyr)
library(zoo)
dat <- read.table(text =
"Customer Buy_date
1 2018/03/01
1 2018/03/19
1 2018/04/3
1 2018/05/10
2 2018/01/02
2 2018/02/10
2 2018/04/13", header = T, stringsAsFactors = F)
dat$Buy_date <- as.Date(dat$Buy_date)
dat %>% group_by(Customer) %>% mutate(diff_between = as.vector(diff(zoo(Buy_date), na.pad=TRUE)),
mean_days = mean(diff_between, na.rm = TRUE))
This produces:
Customer Buy_date diff_between mean_days
<int> <date> <dbl> <dbl>
1 1 2018-03-01 NA 23.3
2 1 2018-03-19 18 23.3
3 1 2018-04-03 15 23.3
4 1 2018-05-10 37 23.3
5 2 2018-01-02 NA 50.5
6 2 2018-02-10 39 50.5
7 2 2018-04-13 62 50.5
EDITED BASED ON USER COMMENTS:
Because you said that you have factors and not characters just convert them by doing the following:
dat$Buy_date <- as.Date(as.character(dat$Buy_date))
dat$Customer <- as.character(dat$Customer)