I have a df with variables named as below
id indexDate eventDate1 eventDate2 V1 V2 V3 ....... V365
For the date range (eventDate1 - indexDate) to (eventDate2 - indexDate), I want to tag the days of occurrence in the V1 to V365 columns.
Each V represents the number of days post-indexDate.
For example:
If:
indexDate is 1/1/2017
eventDate1 is 1/3/2017 (= Day 2)
eventDate2 is 1/5/2017 (= Day 4),
then:
V2-V4 would be assigned a value of 1 and the rest of the V~ are 0.
If there is a better way to do this, feel free to let me know!
Thanks.
This works-
library(dplyr)
library(tidyr)
# Make fake data
dates <- data.frame(id = 1:10,
indexDate = rep(as.Date("17/01/01"), 10),
eventDate1 = as.Date(paste0("17/01/", 1:10)),
eventDate2 = as.Date(paste0("17/01/", 16:25)))
# Step through this to understand what's going on
dates[rep(row.names(dates), 365), ] %>%
arrange(id) %>%
mutate(Day = rep(1:365, nrow(dates)),
Flag = ifelse(Day <= as.numeric(eventDate2 - indexDate) &
Day > as.numeric(eventDate1 - indexDate), 1, 0)) %>%
# move to long format
spread(Day, Flag)
I played with adding a paste0("V", Day) but the spread came out unordered. With this column convention you can refer tot he individual columns with back-ticks `.
dates %>% select(`1`, `2`, `3`)
Related
I'm trying to create a new variable which equals the latest month's value minus the previous month's (or 3 months prior, etc.).
A quick df:
country <- c("XYZ", "XYZ", "XYZ")
my_dates <- c("2021-10-01", "2021-09-01", "2021-08-01")
var1 <- c(1, 2, 3)
df1 <- country %>% cbind(my_dates) %>% cbind(var1) %>% as.data.frame()
df1$my_dates <- as.Date(df1$my_dates)
df1$var1 <- as.numeric(df1$var1)
For example, I've tried (partially from: How to subtract months from a date in R?)
library(tidyverse)
df2 <- df1 %>%
mutate(dif_1month = var1[my_dates==max(my_dates)] -var1[my_dates==max(my_dates) %m-% months(1)]
I've also tried different variations of using lag():
df2 <- df1 %>%
mutate(dif_1month = var1[my_dates==max(my_dates)] - var1[my_dates==max(my_dates)-lag(max(my_dates), n=1L)])
Any suggestions on how to grab the value of a variable when dates equal the second latest observation?
Thanks for help, and apologies for not including any data. Can edit if necessary.
Edited with a few potential answers:
#this gives me the value of var1 of the latest date
df2 <- df1 %>%
mutate(value_1month = var1[my_dates==max(my_dates)])
#this gives me the date of the second latest date
df2 <- df1 %>%
mutate(month1 = max(my_dates) %m-%months(1))
#This gives me the second to latest value
df2 <- df1 %>%
mutate(var1_1month = var1[my_dates==max(my_dates) %m-%months(1)])
#This gives me the difference of the latest value and the second to last of var1
df2 <- df1 %>%
mutate(diff_1month = var1[my_dates==max(my_dates)] - var1[my_dates==max(my_dates) %m-%months(1)])
mutate requires the output to be of the same length as the number of rows of the original data. When we do the subsetting, the length is different. We may need ifelse or case_when
library(dplyr)
library(lubridate)
df1 %>%
mutate(diff_1month = case_when(my_dates==max(my_dates) ~
my_dates %m-% months(1)))
NOTE: Without a reproducible example, it is not clear about the column types and values
Based on the OP's update, we may do an arrange first, grab the last two 'val' and get the difference
df1 %>%
arrange(my_dates) %>%
mutate(dif_1month = diff(tail(var1, 2)))
. my_dates var1 dif_1month
1 XYZ 2021-08-01 3 -1
2 XYZ 2021-09-01 2 -1
3 XYZ 2021-10-01 1 -1
I am looking combine overlapping and non-overlapping periods with 'lubridate' and 'dplyr' packages (or any others that can be advised). Here is an example data frame:
vacation_start <- as_date('2017-04-19')
vacation_end <- as_date('2017-04-25')
course_start <- as_date('2017-04-12')
course_end <- as_date('2017-04-21')
course_interval <- interval(course_start, course_end)
vacation_interval <- interval(vacation_start, vacation_end)
df <- data.frame(id = "ID", part = c("A", "B"),
start = c(course_start,vacation_start),
end = c(course_end, vacation_end),
interval = c(course_interval, vacation_interval))
The data frame looks like this:
id
part
start
end
interval
ID
A
2017-04-12
2017-04-21
2017-04-12 UTC--2017-04-21 UTC
ID
B
2017-04-19
2017-04-25
2017-04-19 UTC--2017-04-25 UTC
I would like to combine them into overlapping and non-overlapping periods like this, grouped by 'ID' and 'part':
id
part
start
end
interval
ID
A
2017-04-12
2017-04-18
2017-04-12 UTC--2017-04-18 UTC
ID
A,B
2017-04-19
2017-04-21
2017-04-19 UTC--2017-04-21 UTC
ID
B
2017-04-22
2017-04-25
2017-04-22 UTC--2017-04-25 UTC
I have tried to generate the middle row with the overlapping periods but am unable to keep the non-overlapping periods with 'dplyr' package:
df_2 <- df %>%
group_by(id) %>%
summarise(drug = paste(drug, collapse = ','),
start = max(start),
end = min(end),
interval = start %--% end)
Any ideas or solutions are much appreciated. Thanks!
My first answer assumes only overlapping two periods. This means it can use a single join for each comparison. Attempting to repeat this process for more than two time periods results in increasing numbers of joins, leading to an inefficient mess.
To handled joining an arbitrary (or unknown) number of overlaps we need a very different method. Hence I am providing this as a separate answer.
Step 1: Obtain a list of all possible start and end dates
all_start = df %>%
select(id, start)
all_end = df %>%
select(id, start = end)
all_start_and_end = rbind(all_start, all_end) %>%
distinct()
Step 2: Create a list of all possible periods
all_periods = all_start_and_end %>%
group_by(id) %>%
mutate(end = lead(start, 1, order_by = start))
Step 3: Overlap original data with all periods and summarise
overlapped = all_periods %>%
left_join(df, by = "id", suffix = c("_1","_2")) %>%
filter(start_1 <= end_2,
start_2 <= end_1) %>%
select(id, part_2, start = start_1, end = end_1) %>%
group_by(id, start, end) %>%
summarise(part = toString(part_2))
Depending on your exact data and situation:
You may want to change "<=" to "<" or subtract 1 day from end dates to ensure periods do not overlap. This depends on how you are handling the boundary conditions of your time periods.
You may want to remove the distinct in step 1 to allow for periods that are only a single day long.
In step 1 you can add a very early date (e.g. 0000-01-01) and a very late date (e.g. 9999-12-31) if you want the output to include all the time periods with part = NA.
Once step three completes you may want to filter out any periods with part = NA.
Depending on your input data you may observe adjacent output periods with the same part. E.g. in row 1: part A has end date 2020-01-01 and in row 2: part A has start date 2020-01-02. Take a look at the gaps-and-islands tag for ways to solve this problem.
I would recommend creating the overlaps and non-overlaps separately. This is often necessary if you want the number of output rows to be greater than the number of input rows.
For the overlaps I would do something like:
overlap_df = df %>%
inner_join(df, by = "id", suffix = c("_1","_2")) %>%
filter(part_1 < part_2,
start_1 <= end_2,
start_2 <= end_1) %>%
mutate(part = paste0(part_1,",",part_2), # new part label
start = ifelse(start_1 < start_2, start_2, start_1), # latest start date
end = ifelse(end_1 < end_2, end_1, end_2)) %>% # earliest end date
select(ID, part, start, end)
The first filter condition ensures that you only have one order for each overlap (e.g. only A,B and not also B,A. The second and third filter conditions ensure that the time periods overlap.
For the non-overlaps I would distinguish between never-overlapped (periods that do not have any overlap with another period) and not-overlapped (the parts of periods that are not overlapped).
For the never-overlapped I would do something like:
never_overlapped_df = df %>%
left_join(df, by = "id", suffix = c("_1","_2")) %>%
filter(part_1 != part_2) %>%
mutate(overlap = ifelse(start_1 <= end_2 & start_2 <= end_2, 1, 0) %>%
group_by(id, part_1, start_1, end_1) %>%
summarise(num = sum(overlap, na.rm = TRUE)) %>%
filter(is.na(num) | num == 0) %>%
select(id, part = part_1, start = start_1, end = end_1)
The idea is to find and count all the overlaps and then keep only the records without any overlaps.
For the non-overlapped I would do something like:
non_overlapped_df = df %>%
inner_join(df, by = "id", suffix = c("_1","_2")) %>%
filter(part_1 != part_2,
start_1 <= end_2,
start_2 <= end_1) %>% # parts are different and periods overlap
mutate(start_2 = ifelse(start_1 <= start_2 & start_2 <= end_1, start_2, NA),
end_2 = ifelse(start_1 <= end_2 & end_2 <= end_1, end_2, NA)) %>%
# discard start_2 & end_2 that are not within start_1 and end_1
group_by(id, part_1, start_1, end_1) %>%
summarise(min_start_2 = min(start_2, na.rm = TRUE),
max_end_2 = max(end_2, na.rm = TRUE)) %>%
mutate(start = ifelse(is.na(max_end_2), start_1, max_end_2),
end = ifelse(is.na(min_start_2), end_1, min_start_2)) %>%
select(id, part = part_1, start, end)
You can then combine these together with rbind:
output_df = rbind(overlap_df, never_overlapped_df, non_overlapped_df)
Note that I have assumed a maximum of one overlap at a time (e.g. part = "A,B,C" does not happen). This simplifies the problem. Solving the more general case for arbitrary numbers of overlaps is much more complex and requires a different approach.
Note you may also want to change "<=" to "<" or subtract 1 day from end dates to ensure periods do not overlap. This depends on how you are handling the boundary conditions of your time periods.
I have a data frame with COVID data and I'm trying to make a column calculating the number of recovered people based off of the number of positive tests.
My data has a location, a date, and the number of tests administered/positive results/negative results each day. Here's a few lines using one location as an example (the real data has several months worth of dates):
loc date tests pos neg active
spot1 2020-04-10 1 1 0 5
spot1 2020-04-11 2 1 1 6
spot1 2020-04-12 0 0 0 6
spot1 2020-04-13 11 1 10 7
I want to make a new column that cumulatively counts each positive test in each location 14 days after it is recorded. On 2020-04-24, the 5 active classes are not active anymore, so I want a recovered column with 5. For each date I want the newly nonactive cases to be added.
My first thought was to try it in a loop:
df1 <- df %>%
mutate(date = as.Date(date)) %>%
group_by(loc) %>%
mutate(rec = for (i in 1:nrow(df)) {
#getting number of new cases
x <- df$pos[i]
#add 14 days to the date
d <- df$date + 14
df$rec <- sum(x)
})
As you can see, I'm not the best at writing for loops. That gives me a bunch of numbers, but bear very little meaningful relationship to the data.
Also tried it with map_dbl:
df1 <- df %>%
mutate(date = as.Date(date)) %>%
group_by(loc) %>%
mutate(rec = map_dbl(date, ~sum(pos[(date <= . + 14) & date >= .])))
Which resulted in the same number printed on the entire rec column.
Any suggestions? (Sorry for the lengthy explanation, just want to make sure this all makes sense)
Your sample data shows that -
you have all continuous dates despite 0 tests (12 April)
Active column seems like already a cumsum
Therefore I think you can simply use lag function with argument 14
example code
df %>% group_by(loc) %>% mutate(recovered = lag(active, 14)) %>% ungroup()
You could use aggregate to sum the specific column and then applying
cut in order to set a 14 day time frame for each sum:
df <- data.frame(loc = rep("spot1", 30),
date = seq(as.Date('2020-04-01'), as.Date('2020-04-30'),by = 1),
test = seq(1:30),
positive = seq(1:30),
active = seq(1:30))
output <- aggregate(positive ~ cut(date, "14 days"), df, sum)
output
Console output:
cut(date, "14 days") positive
1 2020-04-01 105
2 2020-04-15 301
3 2020-04-29 59
my solution:
library(dplyr)
date_seq <- seq(as.Date("2020/04/01"), by = "day", length.out = 30)
pos <- rpois(n = 60, lambda = 10)
mydf <-
data.frame(loc = c(rep('loc1', 30), rep('loc2', 30)),
date = date_seq,
pos = pos)
head(mydf)
getPosSum <- function(max, tbl, myloc, daysBack = 14) {
max.Date <- as.Date(max)
sum(tbl %>%
filter(date >= max.Date - (daysBack - 1) &
date <= max.Date & loc == myloc) %>%
select(pos))
}
result <-
mydf %>%
group_by(date, loc) %>%
mutate(rec = getPosSum(max = date, tbl = mydf, myloc = loc))
library(tidyverse)
library(lubridate)
data %>%
mutate(date = as_date(date),
cut = cut(date, '14 days') %>%
group_by(loc) %>%
arrange(cut) %>%
mutate(cum_pos = accumulate(pos, `+`)) # accumulate(pos, sum) should also work
As a general rule of thumb, avoid loops, especially within mutate - that won't work. Instead of map_dbl you should check out purrr::accumulate. There's specialized functions for this in R's base library such as cumsum and cummin but their behavior is a lot less predictable in relation to purrr's.
I try to describe my problem via the code below. I have a data frame of a 'compressed' time series in the form of data frame: have. It contains the start and end date of a period with a value over time. I want to repeat the data as in data frame: want to ultimately get to the data frame: ultimately_want which sums up the value over time. Maybe I do not need want and get straight to ultimately_want somehow? Thanks.
library(dplyr)
start_date <- as.Date(c("2004-08-02", "2004-08-03"))
end_date <- as.Date(c("2004-08-04", "2004-08-05"))
value <- c(5, 6)
have <- data.frame(start_date, end_date, value)
have
date <- as.Date(c("2004-08-02", "2004-08-03", "2004-08-04", "2004-08-03", "2004-08-04", "2004-08-05"))
value <- c(5, 5, 5, 6, 6, 6)
want <- data.frame(date, value)
want
ultimately_want <- want %>%
group_by(date) %>%
summarise(total = sum(value))
ultimately_want
Here is a data.table approach,
library(data.table)
setDT(have)[, .(value = value, date = seq(start_date, end_date, by = "day")),
by = 1:nrow(have)][,.(total = sum(value)), date][]
# date total
#1: 2004-08-02 5
#2: 2004-08-03 11
#3: 2004-08-04 11
#4: 2004-08-05 6
How to add one column price.wk.average to the data such that price.wk.average is equal to the average price of last week, and also add one column price.mo.average to the data such that it equals to the average price of last month? The price.wk.average will be the same for the entire week.
Dates Price Demand Price.wk.average Price.mo.average
2010-1-1 x x
2010-1-2 x x
......
2015-1-1 x x
jkl,
try to post reproducible examples. It will make it easier to help you. you can use dplyr:
library(dplyr)
df <- data.frame(date = seq(as.Date("2017-1-1"),by="day",length.out = 100), price = round(runif(100)*100+50,0))
df <- df %>%
group_by(week = week(date)) %>%
mutate(Price.wk.average = mean(price)) %>%
ungroup() %>%
group_by(month = month(date)) %>%
mutate(Price.mo.average = mean(price))
(Since I don't have enough points to comment)
I wanted to point out that Eric's answer will not distinguish average weekly price by year. Therefore, if you are interested in unique weeks (Week 1 of 2012 != Week 1 of 2015 ), you will need to do extra work to group by unique weeks.
df <- data.frame( Dates = c("2010-1-1", "2010-1-2", "2015-01-3"),
Price = c(50, 20, 40) )
Dates Price
1 2010-1-1 50
2 2010-1-2 20
3 2015-01-3 40
Just to keep your data frame tidy, I suggest converting dates to POSIX format then sorting the data frame:
library(lubridate)
df <- df %>%
mutate(Dates = lubridate::parse_date_time(Dates,"ymd")) %>%
arrange( Dates )
To group by unique weeks:
df <- df %>%
group_by( yw = paste( year(Dates), week(Dates)))
Then mutate and ungroup.
To group by unique months:
df <- df %>%
group_by( ym = paste( year(Dates), month(Dates)))
and mutate and ungroup.