I have a dataset of observations with start and end dates. I would like to calculate the moving average difference between the start and end dates.
I've included an example dataset below.
require(dplyr)
df <- data.frame(id=c(1,2,3),
start=c("2019-01-01","2019-01-10", "2019-01-05"),
end=c("2019-02-01", "2019-01-15", "2019-01-10"))
df[,c("start", "end")] <- lapply(df[,c("start", "end")], as.Date)
id start end
1 2019-01-01 2019-02-01
2 2019-01-10 2019-01-15
3 2019-01-05 2019-01-10
The overall date ranges are 2019-01-01 to 2019-02-01. I would like to calculate the average difference between the start and end dates for each of the dates in that range.
The result would look exactly like this. I've included the actual values for the averages that should show up:
date avg
2019-01-01 0
2019-01-02 1
2019-01-03 2
2019-01-04 3
2019-01-05 4
2019-01-06 3
2019-01-07 4
2019-01-08 5
2019-01-09 6
2019-01-10 7
2019-01-11 5.5
. .
. .
. .
Creating a reproducible example:
df <- data.frame(id=c(1,2,3,4),
start=c("2019-01-01","2019-01-01", "2019-01-10", "2019-01-05"),
end=c("2019-01-04", "2019-01-05", "2019-01-12", "2019-01-08"))
df[,c("start", "end")] <- lapply(df[,c("start", "end")], as.Date)
df
Returns:
id start end
1 2019-01-01 2019-01-04
2 2019-01-01 2019-01-05
3 2019-01-10 2019-01-12
4 2019-01-05 2019-01-08
Then using the group_by function from dplyr:
library(dplyr)
df %>%
group_by(start) %>%
summarize(avg=mean(end - start)) %>%
rename(date=start)
Returns:
date avg
<time> <time>
2019-01-01 3.5 days
2019-01-05 3.0 days
2019-01-10 2.0 days
Editing the answer as per comments.
Creating the df:
require(dplyr)
df <- data.frame(id=c(1,2,3),
start=c("2019-01-01", "2019-01-10", "2019-01-05"),
end=c("2019-02-01", "2019-01-15", "2019-01-10"))
df[,c("start", "end")] <- lapply(df[,c("start", "end")], as.Date)
Create dates for every start-end combination:
#gives the list of all dates within start and end frames and calculates difference
datesList = lapply(1:nrow(df),function(i){
dat = data_frame('date'=seq.Date(from=df$start[i],to=df$end[i],by=1),
'start'=df$start[i]) %>%
dplyr::mutate(diff=date-start)
})
Finally, group_by the date and find avg to give output exactly as the one in the question:
finalDf = bind_rows(datesList) %>%
dplyr::filter(diff != 0) %>%
dplyr::group_by(date) %>%
dplyr::summarise(avg=mean(diff,na.rm=T))
The output thus becomes:
# A tibble: 31 x 2
date avg
<date> <time>
1 2019-01-02 1.0 days
2 2019-01-03 2.0 days
3 2019-01-04 3.0 days
4 2019-01-05 4.0 days
5 2019-01-06 3.0 days
6 2019-01-07 4.0 days
7 2019-01-08 5.0 days
8 2019-01-09 6.0 days
9 2019-01-10 7.0 days
10 2019-01-11 5.5 days
# … with 21 more rows
Let me know if it works.
Related
I have a dataset with dates in tibble format from tidyverse/dplyr.
library(tidyverse)
A = seq(from = as.Date("2019/1/1"),to=as.Date("2022/1/1"), length.out = 252*3)
length(A)
x = rnorm(252*3)
d = tibble(A,x);d
Resulting to :
# A tibble: 756 x 2
A x
<date> <dbl>
1 2019-01-01 1.43
2 2019-01-02 0.899
3 2019-01-03 0.658
4 2019-01-05 -0.0720
5 2019-01-06 -1.99
6 2019-01-08 -0.743
7 2019-01-09 0.426
8 2019-01-11 0.00675
9 2019-01-12 0.967
10 2019-01-14 -0.606
# ... with 746 more rows
i also have a date of interest, say:
start = as.Date("2021/12/15");start
I want to subset the dataset from this specific date (start) and one year back. But the year has 252 observations.
i tried :
d%>%
dplyr::filter(A<start)%>%
dplyr::slice_tail(n=252)
but i don't like it because my real dataset has more than one factor label and if i use this then i will have 252 observations.
i also tried :
LAST_YEAR = DATE-365
d%>%
dplyr::filter(Date <= DATE & Date >=LAST_YEAR)
which works but i want to use the 252.Imagine that i want to find 2 years (252*2) back how many observations i have on this specific time interval.
Any help how i can do that?
This seems like it should be straightforward but I cannot find a way to do this.
I have a sales cycle that begins ~ August 1 of each year and need to sum sales by week number. I need to create a "week number" field where week #1 begins on a date that I specify. Thus far I have looked at lubridate, baseR, and strftime, and I cannot find a way to change the "start" date from 01/01/YYYY to something else.
Solution needs to let me specify the start date and iterate week numbers as 7 days from the start date. The actual start date doesn't always occur on a Sunday or Monday.
EG Data Frame
eg_data <- data.frame(
cycle = c("cycle2019", "cycle2019", "cycle2018", "cycle2018", "cycle2017", "cycle2017", "cycle2016", "cycle2016"),
dates = as.POSIXct(c("2019-08-01" , "2019-08-10" ,"2018-07-31" , "2018-08-16", "2017-08-03" , "2017-08-14" , "2016-08-05", "2016-08-29")),
week_n = c("1", "2","1","3","1","2","1","4"))
I'd like the result to look like what is above - it would take the min date for each cycle and use that as a starting point, then iterate up week numbers based on a given date's distance from the cycle starting date.
This almost works. (Doing date arithmetic gives us durations in seconds: there may be a smoother way to convert with lubridate tools?)
secs_per_week <- 60*60*24*7
(eg_data
%>% group_by(cycle)
%>% mutate(nw=1+as.numeric(round((dates-min(dates))/secs_per_week)))
)
The results don't match for 2017, because there is an 11-day gap between the first and second observation ...
cycle dates week_n nw
<chr> <dttm> <chr> <dbl>
5 cycle2017 2017-08-03 00:00:00 1 1
6 cycle2017 2017-08-14 00:00:00 2 3
If someone has a better answer plz post, but this works -
Take the dataframe in the example, eg_data -
eg_data %>%
group_by(cycle) %>%
mutate(
cycle_start = as.Date(min(dates)),
days_diff = as.Date(dates) - cycle_start,
week_n = days_diff / 7,
week_n_whole = ceiling(days_diff / 7) ) -> eg_data_check
(First time I've answered my own question)
library("lubridate")
eg_data %>%
as_tibble() %>%
group_by(cycle) %>%
mutate(new_week = week(dates)-31)
This doesn't quite work the same as your example, but perhaps with some fiddling based on your domain experience you could adapt it:
library(lubridate)
eg_data %>%
mutate(aug1 = ymd_h(paste(str_sub(cycle, start = -4), "080100")),
week_n2 = ceiling((dates - aug1)/ddays(7)))
EDIT: If you have specific known dates for the start of each cycle, it might be helpful to join those dates to your data for the calc:
library(lubridate)
cycle_starts <- data.frame(
cycle = c("cycle2019", "cycle2018", "cycle2017", "cycle2016"),
start_date = ymd_h(c(2019080100, 2018072500, 2017080500, 2016071300))
)
eg_data %>%
left_join(cycle_starts) %>%
mutate(week_n2 = ceiling((dates - start_date)/ddays(7)))
#Joining, by = "cycle"
# cycle dates week_n start_date week_n2
#1 cycle2019 2019-08-01 1 2019-08-01 1
#2 cycle2019 2019-08-10 2 2019-08-01 2
#3 cycle2018 2018-07-31 1 2018-07-25 1
#4 cycle2018 2018-08-16 3 2018-07-25 4
#5 cycle2017 2017-08-03 1 2017-08-05 0
#6 cycle2017 2017-08-14 2 2017-08-05 2
#7 cycle2016 2016-08-05 1 2016-07-13 4
#8 cycle2016 2016-08-29 4 2016-07-13 7
This is a concise solution using lubridate
library(lubridate)
eg_data %>%
group_by(cycle) %>%
mutate(new_week = floor(as.period(ymd(dates) - ymd(min(dates))) / weeks()) + 1)
# A tibble: 8 x 4
# Groups: cycle [4]
cycle dates week_n new_week
<chr> <dttm> <chr> <dbl>
1 cycle2019 2019-08-01 00:00:00 1 1
2 cycle2019 2019-08-10 00:00:00 2 2
3 cycle2018 2018-07-31 00:00:00 1 1
4 cycle2018 2018-08-16 00:00:00 3 3
5 cycle2017 2017-08-03 00:00:00 1 1
6 cycle2017 2017-08-14 00:00:00 2 2
7 cycle2016 2016-08-05 00:00:00 1 1
8 cycle2016 2016-08-29 00:00:00 4 4
I have a data frame with 10,000+ dates. for example,
indexdt
01-02-2019
08-15-2019
I need to create two data frames based on the following conditions-
generate dates such that I get same day of week, upto 3 weeks before and after the index date. The out put should be
Table 1
indexdt dates
01-02-2019 12-26-2018
01-02-2019 12-19-2018
01-02-2019 12-12-2018
01-02-2019 01-09-2019
01-02-2019 01-16-2019
01-02-2019 01-23-2019
08-15-2019 07-25-2019
08-15-2019 08-01-2019
08-15-2019 08-08-2019
08-15-2019 08-22-2019
08-15-2019 08-29-2019
08-15-2019 08-05-2019
same day of week, same month. The output should be
Table 2
indexdt date
01-02-2019 01-09-2019
01-02-2019 01-16-2019
01-02-2019 01-23-2019
01-02-2019 01-30-2019
08-15-2019 08-01-2019
08-15-2019 08-08-2019
08-15-2019 08-22-2019
08-15-2019 08-29-2019
I have answered both the questions here but you should only ask one question in one post :
library(dplyr)
library(purrr)
library(lubridate)
#Convert to date
df <- df %>% mutate(indexdt = mdy(indexdt))
generate dates such that I get same day of week, upto 3 weeks before and after the index date
We use seq to generate before and after dates separately. [-1] is used to ignore the indexdt date since we don't want that in final output.
df %>%
mutate(dates = map(indexdt, ~c(seq(.x, length.out = 4, by = -7)[-1],
seq(.x, length.out = 4, by = 7)[-1]))) %>%
unnest(dates)
# indexdt dates
# <date> <date>
# 1 2019-01-02 2018-12-26
# 2 2019-01-02 2018-12-19
# 3 2019-01-02 2018-12-12
# 4 2019-01-02 2019-01-09
# 5 2019-01-02 2019-01-16
# 6 2019-01-02 2019-01-23
# 7 2019-08-15 2019-08-08
# 8 2019-08-15 2019-08-01
# 9 2019-08-15 2019-07-25
#10 2019-08-15 2019-08-22
#11 2019-08-15 2019-08-29
#12 2019-08-15 2019-09-05
same day of week, same month.
Here we create a sequence from indexdt date to start of the month (floor_date) and another sequence from indexdt to end of the month (ceiling_date - 1).
df %>%
mutate(dates = map(indexdt, ~c(seq(.x, floor_date(.x, 'month'), by = -7)[-1],
seq(.x, ceiling_date(.x, 'month') - 1, by = 7)[-1]))) %>%
unnest(dates)
# indexdt dates
# <date> <date>
#1 2019-01-02 2019-01-09
#2 2019-01-02 2019-01-16
#3 2019-01-02 2019-01-23
#4 2019-01-02 2019-01-30
#5 2019-08-15 2019-08-08
#6 2019-08-15 2019-08-01
#7 2019-08-15 2019-08-22
#8 2019-08-15 2019-08-29
data
df <- structure(list(indexdt = c("01-02-2019", "08-15-2019")),
class = "data.frame", row.names = c(NA, -2L))
I'm getting started with R, so please bear with me
For example, I have this data.table (or data.frame) object :
Time Station count_starts count_ends
01/01/2015 00:30 A 2 3
01/01/2015 00:40 A 2 1
01/01/2015 00:55 B 1 1
01/01/2015 01:17 A 3 1
01/01/2015 01:37 A 1 1
My end goal is to group the "Time" column to hourly and sum the count_starts and count_ends based on the hourly time and station :
Time Station sum(count_starts) sum(count_ends)
01/01/2015 01:00 A 4 4
01/01/2015 01:00 B 1 1
01/01/2015 02:00 A 4 2
I did some research and found out that I should use the xts library.
Thanks for helping me out
UPDATE :
I converted the type of transactions$Time to POSIXct, so the xts package should be able to use the timeseries directly.
Using base R, we can still do the above. Only that the hour will be one less for all of them:
dat=read.table(text = "Time Station count_starts count_ends
'01/01/2015 00:30' A 2 3
'01/01/2015 00:40' A 2 1
'01/01/2015 00:55' B 1 1
'01/01/2015 01:17' A 3 1
'01/01/2015 01:37' A 1 1",
header = TRUE, stringsAsFactors = FALSE)
dat$Time=cut(strptime(dat$Time,"%m/%d/%Y %H:%M"),"hour")
aggregate(.~Time+Station,dat,sum)
Time Station count_starts count_ends
1 2015-01-01 00:00:00 A 4 4
2 2015-01-01 01:00:00 A 4 2
3 2015-01-01 00:00:00 B 1 1
You can use the order function to rearrange the table or even the sort.POSIXlt function:
m=aggregate(.~Time+Station,dat,sum)
m[order(m[,1]),]
Time Station count_starts count_ends
1 2015-01-01 00:00:00 A 4 4
3 2015-01-01 00:00:00 B 1 1
2 2015-01-01 01:00:00 A 4 2
A solution using dplyr and lubridate. The key is to use ceiling_date to convert the date time column to hourly time-step, and then group and summarize the data.
library(dplyr)
library(lubridate)
dt2 <- dt %>%
mutate(Time = mdy_hm(Time)) %>%
mutate(Time = ceiling_date(Time, unit = "hour")) %>%
group_by(Time, Station) %>%
summarise(`sum(count_starts)` = sum(count_starts),
`sum(count_ends)` = sum(count_ends)) %>%
ungroup()
dt2
# # A tibble: 3 x 4
# Time Station `sum(count_starts)` `sum(count_ends)`
# <dttm> <chr> <int> <int>
# 1 2015-01-01 01:00:00 A 4 4
# 2 2015-01-01 01:00:00 B 1 1
# 3 2015-01-01 02:00:00 A 4 2
DATA
dt <- read.table(text = "Time Station count_starts count_ends
'01/01/2015 00:30' A 2 3
'01/01/2015 00:40' A 2 1
'01/01/2015 00:55' B 1 1
'01/01/2015 01:17' A 3 1
'01/01/2015 01:37' A 1 1",
header = TRUE, stringsAsFactors = FALSE)
Explanation
mdy_hm is the function to convert the string to date-time class. It means "month-day-year hour-minute", which depends on the structure of the string. ceiling_date rounds a date-time object up based on the unit specified. group_by is to group the variable. summarise is to conduct summary operation.
There are basically two things required:
1) round of the Time to nearest 1 hour window:
library(data.table)
library(lubridate)
data=data.table(Time=c('01/01/2015 00:30','01/01/2015 00:40','01/01/2015 00:55','01/01/2015 01:17','01/01/2015 01:37'),Station=c('A','A','B','A','A'),count_starts=c(2,2,1,3,1),count_ends=c(3,1,1,1,1))
data[,Time_conv:=as.POSIXct(strptime(Time,'%d/%m/%Y %H:%M'))]
data[,Time_round:=floor_date(Time_conv,unit="1 hour")]
2) List the data table obtained above to get the desired result:
New_data=data[,list(count_starts_sum=sum(count_starts),count_ends_sum=sum(count_ends)),by='Time_round']
I have this csv data
Date Kilometer
2015-01-01 15:56:00 1
2015-01-01 17:40:00 2
2015-01-02 14:38:00 4
2015-01-02 14:45:00 3
And would like to group date and sum kilometer like that
Date Kilometer
2015-01-01 3
2015-01-02 7
We can use data.table
library(data.table)
library(lubridate)
setDT(df)[, .(Kilometer = sum(Kilometer)) , .(Date=date(Date))]
This can be done using dplyr and lubridate
library(dplyr)
df %>% group_by(Date = lubridate::date(Date)) %>% summarise(Kilometer=sum(Kilometer))
Date Kilometer
(date) (int)
1 2015-01-01 3
2 2015-01-02 7