I'm trying to populate "FinalDate" based on "ExpectedDate" and "ObservedDate".
The rules are: for each group, if observed date is greater than previous expected date and less than the next expected date then final date is equal to observed date, otherwise final date is equal to expected date.
How can I modify the code below to make sure that:
FinalDate is filled in by Group
Iteration numbers don't skip any rows
set.seed(2)
dat<-data.frame(Group=sample(LETTERS[1:10], 100, replace=TRUE),
Date=sample(seq(as.Date('2013/01/01'), as.Date('2020/01/01'), by="day"), 100))%>%
mutate(ExpectedDate=Date+sample(10:200, 100, replace=TRUE),
ObservedDate=Date+sample(10:200, 100, replace=TRUE))%>%
group_by(Group)%>%
arrange(Date)%>%
mutate(n=row_number())%>%arrange(Group)%>%ungroup()%>%
as.data.frame()
#generate some missing values in "ObservedDate"
dat[sample(nrow(dat),20), "ObservedDate"]<-NA
dat$FinalDate<-NA
for (i in 1:nrow(dat)){
dat[i, "FinalDate"]<-if_else(!is.na(dat$"ObservedDate")[i] &&
dat[i, "ObservedDate"] > dat[i-1, "ExpectedDate"] &&
dat[i, "ObservedDate"] < dat[i+1, "ExpectedDate"],
dat[i, "ObservedDate"],
dat[i,"ExpectedDate"])
}
dat$FinalDate<-as.Date(dat$FinalDate) # convert numeric to Date format
e.g. in output below:
at i=90, the code looks for previous ExpectedDate within letter I
we want it to look for ExpectedDate only within letter J. If there is no previous expected date for a group and ObservedDate is greater than ExpectedDate but less than the next ExpectedDate then FinalDate should be filled with ExpectedDate.
at i=100, the code generates NA because there is no next observation available
we want this value to be filled in such that for last observation in each group, FinalDate=ObservedDate if ObservedDate is greater than this last ExpectedDate within group, else ExpectedDate.
Group Date ExpectedDate ObservedDate n FinalDate
88 I 2015-09-07 2015-12-05 <NA> 7 2015-12-05
89 I 2018-08-02 2018-11-01 2018-08-13 8 2018-11-01
90 J 2013-07-24 2013-08-30 2013-08-12 1 2013-08-30
91 J 2013-11-22 2014-01-02 2014-04-05 2 2014-04-05
92 J 2014-11-03 2015-03-23 2015-05-10 3 2015-05-10
93 J 2015-08-30 2015-12-09 2016-02-04 4 2016-02-04
94 J 2016-04-18 2016-09-03 <NA> 5 2016-09-03
95 J 2016-10-10 2017-01-29 2017-04-14 6 2017-04-14
96 J 2017-02-14 2017-07-05 <NA> 7 2017-07-05
97 J 2017-04-21 2017-10-01 2017-08-26 8 2017-08-26
98 J 2017-10-01 2018-01-27 2018-02-28 9 2018-02-28
99 J 2018-08-03 2019-01-31 2018-10-20 10 2018-10-20
100 J 2019-04-25 2019-06-23 2019-08-16 11 <NA>
We can let go off for loop and use group_by, lag and lead here from dplyr :
library(dplyr)
dat %>%
group_by(Group) %>%
mutate(FinalDate = if_else(ObservedDate > lag(ExpectedDate) &
ObservedDate < lead(ExpectedDate), ObservedDate, ExpectedDate))
We can also do this data.table::between
dat %>%
group_by(Group) %>%
mutate(FinalDate = if_else(data.table::between(ObservedDate,
lag(ExpectedDate), lead(ExpectedDate)), ObservedDate, ExpectedDate))
Related
My codes executes just fine, however it takes an enourmous amount of time to finalize. Would like some help to optimize the code, if possible, a way to execute a rolling aggregation on multiple columns.
I've been tring a few other ways by creating a function and vectorizing my dataframe with library(data.table), but no success in doing so, i actually get half of what i should get and I can only do with one column at a time.
# Creating functions
fun <- function(x, date, thresh) {
D <- as.matrix(dist(date)) #distance matrix between dates
D <- D <= thresh
D[lower.tri(D)] <- FALSE #don't sum to future
R <- D * x #FALSE is treated as 0
colMeans(R, na.rm = TRUE)
}
setDT(df_2)
df_2[, invoiceDate := as.Date(invoiceDate, format = "%m/%d/%Y")]
setkey(df_2, cod_unb, cod_pdv, invoiceDate)
df_2[, volume_total_diario_RT30 := fun(volume_total_diario, invoiceDate, 30), by = list(cod_unb, cod_pdv)]
This is my current code that works fine, but takes too much time (Over 8h to process 30 days)
years <- c(2017:2019)
months <- c(1:12)
days <- c(1:31)
df_final <- df_n[1,c('cod_unb','cod_pdv','cpf_cnpj','idade_pdv_meses','status_telefone','col1','col2','col3','year','month','day')] #eliminating first line
for (i in years) {
for (j in months) {
for (k in days) {
if (j == 1){
df_temp <- df_n[(df_n$years == i & df_n$months == j & df_n$days <= k) | (df_n$years == (i-1) & df_n$months == 12 & df_n$days >= k),]
}
if (j != 1){
df_temp <- df_n[(df_n$years == i & df_n$months == j & df_n$days <= k) | (df_n$years == i & df_n$months == (j - 1) & df_n$days >= k),]
}
#Agreggate.
if(nrow(df_temp) >= 1){
df_temp <- aggregate(df_temp[, c('col1','col2','col3')], by = list(df_temp$cod_unb,df_temp$cod_pdv,df_temp$cpf_cnpj,df_temp$idade_pdv_meses,df_temp$status_telefone), FUN = mean)
names(df_temp)[names(df_temp) == "Group.1"] <- "cod_unb"
names(df_temp)[names(df_temp) == "Group.2"] <- "cod_pdv"
names(df_temp)[names(df_temp) == "Group.3"] <- "cpf_cnpj"
names(df_temp)[names(df_temp) == "Group.4"] <- "idade_pdv_meses"
names(df_temp)[names(df_temp) == "Group.5"] <- "status_telefone"
df_temp$years <- i
df_temp$months <- j
df_temp$days <- k
df_final <- rbind(df_final,df_temp)
}
}
}
}
df_final <- df_final[-1,]
Output should be column R30
cod_unb;cod_pdv;Years;Months;Days;date;volume_total_diario;R30
111;1005;2018;11;3;03/11/2018;0.48;
111;1005;2018;11;9;09/11/2018;0.79035;
111;1005;2018;11;16;16/11/2018;1.32105;
111;1005;2018;11;24;24/11/2018;0.6414;
111;1005;2018;11;30;30/11/2018;0.6;
111;1005;2018;12;7;07/12/2018;1.79175;1.02891
111;1005;2018;12;15;15/12/2018;1.4421;1.15926
111;1005;2018;12;21;21/12/2018;0.48;0.99105
111;1005;2018;12;28;28/12/2018;0.5535;0.97347
111;1005;2019;1;4;04/01/2019;0.36;0.92547
If I understand correctly, the OP has requested to aggregate values over a rolling period of 30 days and to append these aggregates to the original data.
This can be solved efficiently by aggregating in a non-equi join.
Here is an example for one variable using sample data provided by the OP:
library(data.table)
# coerce to data.table, coerce character date to class IDate
setDT(df_n)[, date := as.IDate(date, "%d/%m/%Y")]
# intermediate result for demonstration:
df_n[.(upper = date, lower = date - 30), on = .(date <= upper, date >= lower),
mean(volume_total_diario), by = .EACHI]
date date V1
1: 2018-11-03 2018-10-04 0.480000
2: 2018-11-09 2018-10-10 0.635175
3: 2018-11-16 2018-10-17 0.863800
4: 2018-11-24 2018-10-25 0.808200
5: 2018-11-30 2018-10-31 0.766560
6: 2018-12-07 2018-11-07 1.028910
7: 2018-12-15 2018-11-15 1.159260
8: 2018-12-21 2018-11-21 0.991050
9: 2018-12-28 2018-11-28 0.973470
10: 2019-01-04 2018-12-05 0.925470
The intermediate result shows the upper and lower limits of the date range included in the aggregation and the aggragated values for the respective periods. This can be used to add a new column to df_n:
# update df_n by appending new column
setDT(df_n)[, R30_new := df_n[.(upper = date, lower = date - 30), on = .(date <= upper, date >= lower),
mean(volume_total_diario), by = .EACHI]$V1]
df_n
cod_unb cod_pdv Years Months Days date volume_total_diario R30 R30_new
1: 111 1005 2018 11 3 2018-11-03 0.48000 NA 0.480000
2: 111 1005 2018 11 9 2018-11-09 0.79035 NA 0.635175
3: 111 1005 2018 11 16 2018-11-16 1.32105 NA 0.863800
4: 111 1005 2018 11 24 2018-11-24 0.64140 NA 0.808200
5: 111 1005 2018 11 30 2018-11-30 0.60000 NA 0.766560
6: 111 1005 2018 12 7 2018-12-07 1.79175 1.02891 1.028910
7: 111 1005 2018 12 15 2018-12-15 1.44210 1.15926 1.159260
8: 111 1005 2018 12 21 2018-12-21 0.48000 0.99105 0.991050
9: 111 1005 2018 12 28 2018-12-28 0.55350 0.97347 0.973470
10: 111 1005 2019 1 4 2019-01-04 0.36000 0.92547 0.925470
The values of R30 and R30_new are identical; R30_new contains also results for the first 5 rows.
Caveat
Additional grouping variables have been ignored for the sake of clarity but can be included easily. Also, the solution can be extended to aggregate multiple value columns.
Data
library(data.table)
df_n <- fread("
cod_unb;cod_pdv;Years;Months;Days;date;volume_total_diario;R30
111;1005;2018;11;3;03/11/2018;0.48;
111;1005;2018;11;9;09/11/2018;0.79035;
111;1005;2018;11;16;16/11/2018;1.32105;
111;1005;2018;11;24;24/11/2018;0.6414;
111;1005;2018;11;30;30/11/2018;0.6;
111;1005;2018;12;7;07/12/2018;1.79175;1.02891
111;1005;2018;12;15;15/12/2018;1.4421;1.15926
111;1005;2018;12;21;21/12/2018;0.48;0.99105
111;1005;2018;12;28;28/12/2018;0.5535;0.97347
111;1005;2019;1;4;04/01/2019;0.36;0.92547
")
EDIT: Aggregating multiple variables
As the OP has asked for a way to execute a rolling aggregation on multiple columns here is an example.
First, we need to create an additional value var in OP's sample dataset:
df_n <- fread("
cod_unb;cod_pdv;Years;Months;Days;date;volume_total_diario;R30
111;1005;2018;11;3;03/11/2018;0.48;
111;1005;2018;11;9;09/11/2018;0.79035;
111;1005;2018;11;16;16/11/2018;1.32105;
111;1005;2018;11;24;24/11/2018;0.6414;
111;1005;2018;11;30;30/11/2018;0.6;
111;1005;2018;12;7;07/12/2018;1.79175;1.02891
111;1005;2018;12;15;15/12/2018;1.4421;1.15926
111;1005;2018;12;21;21/12/2018;0.48;0.99105
111;1005;2018;12;28;28/12/2018;0.5535;0.97347
111;1005;2019;1;4;04/01/2019;0.36;0.92547
")[
, date := as.IDate(date, "%d/%m/%Y")][, var2 := .I][]
df_n
cod_unb cod_pdv Years Months Days date volume_total_diario R30 var2
1: 111 1005 2018 11 3 2018-11-03 0.48000 NA 1
2: 111 1005 2018 11 9 2018-11-09 0.79035 NA 2
3: 111 1005 2018 11 16 2018-11-16 1.32105 NA 3
4: 111 1005 2018 11 24 2018-11-24 0.64140 NA 4
5: 111 1005 2018 11 30 2018-11-30 0.60000 NA 5
6: 111 1005 2018 12 7 2018-12-07 1.79175 1.02891 6
7: 111 1005 2018 12 15 2018-12-15 1.44210 1.15926 7
8: 111 1005 2018 12 21 2018-12-21 0.48000 0.99105 8
9: 111 1005 2018 12 28 2018-12-28 0.55350 0.97347 9
10: 111 1005 2019 1 4 2019-01-04 0.36000 0.92547 10
So, a column var2 has been added (which simply contains the row number).
This is the code to aggregate multiple column using the same aggregation function:
cols <- c("volume_total_diario", "var2")
setDT(df_n)[, paste0("mean_", cols) :=
df_n[.(upper = date, lower = date - 30),
on = .(date <= upper, date >= lower),
lapply(.SD, mean),
.SDcols = cols, by = .EACHI][
, .SD, .SDcols = cols]][]
df_n
cod_unb cod_pdv Years Months Days date volume_total_diario R30 var2 mean_volume_total_diario mean_var2
1: 111 1005 2018 11 3 2018-11-03 0.48000 NA 1 0.480000 1.0
2: 111 1005 2018 11 9 2018-11-09 0.79035 NA 2 0.635175 1.5
3: 111 1005 2018 11 16 2018-11-16 1.32105 NA 3 0.863800 2.0
4: 111 1005 2018 11 24 2018-11-24 0.64140 NA 4 0.808200 2.5
5: 111 1005 2018 11 30 2018-11-30 0.60000 NA 5 0.766560 3.0
6: 111 1005 2018 12 7 2018-12-07 1.79175 1.02891 6 1.028910 4.0
7: 111 1005 2018 12 15 2018-12-15 1.44210 1.15926 7 1.159260 5.0
8: 111 1005 2018 12 21 2018-12-21 0.48000 0.99105 8 0.991050 6.0
9: 111 1005 2018 12 28 2018-12-28 0.55350 0.97347 9 0.973470 7.0
10: 111 1005 2019 1 4 2019-01-04 0.36000 0.92547 10 0.925470 8.0
Note that the new columns have been named programmtically.
I have a data frame of dates and times. I've included the pw_backup column as an example of 13 other columns that I have. I found the differences in time between the two columns to create two more columns in the same data frame called dur_days and dur_hour.
first_pmt_date pw_backup
<dttm> <dttm>
1 2016-04-12 18:57:00 2016-04-12 18:44:00
2 2016-05-02 17:06:00 2016-05-02 16:41:00
3 2016-04-06 08:35:00 2016-04-06 08:33:00
4 2016-03-15 22:38:00 2016-03-15 22:12:00
5 2016-04-15 14:36:00 2016-04-15 14:30:00
6 2016-03-22 16:51:00 2016-03-22 16:43:00
7 2016-03-25 07:52:00 2016-05-31 07:40:00
8 2016-04-11 12:39:00 2016-04-11 12:22:00
9 2016-03-08 13:13:00 2016-03-08 09:50:00
10 2016-02-28 13:43:00 2016-05-08 15:44:00
My code gives me the output that I want. I am having trouble changing it into a function, and eventually a for loop looping through all of the columns, so I can add any column to (x) and get the same output.
My Current Code:
paywall_full %>%
filter(paid == 1 & !is.na(pw_backup)) %>%
mutate(dur_days = round(difftime(first_pmt_date, pw_backup, units= 'days')), 0,
dur_hour = difftime(first_pmt_date, pw_backup)) %>%
select(first_pmt_date, pw_backup, dur_days, dur_hour) %>%
summarise(same_day_conv = sum(dur_days == 0)/count_it$pw_backup,
same_hour_conv = sum(dur_hour <= 60 & dur_hour >=
0)/count_it$pw_backup)
The code that I imagined would work, replacing the current column with x so as to add any other column into my function and have the same output.
conv_rate <- function(x)
paywall_full %>%
filter(paid == 1 & !is.na(x)) %>%
mutate(dur_days = round(difftime(first_pmt_date, x, units = 'days')), 0,
dur_hour = difftime(first_pmt_date, x)) %>%
select(first_pmt_date, x, dur_days, dur_hour) %>%
summarise(same_day_conv = sum(dur_days == 0)/count_it$pw_backup,
same_hour_conv = sum(dur_hour <= 60 & dur_hour >=
0)/count_it$pw_backup)
I understand why it doesn't work, if I define a variable beforehand
x <- paywall_full$pw_backup
This overwrites the pipeline every time it passes to another function. I hope my question is clear.
BONUS: Turning this into a loop through my columns and assigning to a data frame.
Thanks in advance!
Date Sales
3/11/2017 1
3/12/2017 0
3/13/2017 40
3/14/2017 47
3/15/2017 83
3/16/2017 62
3/17/2017 13
3/18/2017 58
3/19/2017 27
3/20/2017 17
3/21/2017 71
3/22/2017 76
3/23/2017 8
3/24/2017 13
3/25/2017 97
3/26/2017 58
3/27/2017 80
3/28/2017 77
3/29/2017 31
3/30/2017 78
3/31/2017 0
4/1/2017 40
4/2/2017 58
4/3/2017 32
4/4/2017 31
4/5/2017 90
4/6/2017 35
4/7/2017 88
4/8/2017 16
4/9/2017 72
4/10/2017 39
4/11/2017 8
4/12/2017 88
4/13/2017 93
4/14/2017 57
4/15/2017 23
4/16/2017 15
4/17/2017 6
4/18/2017 91
4/19/2017 87
4/20/2017 44
Here current date is 20/04/2017, My question is grouping data from 19/04/2017 to 11/03/2017 with 4 equal parts with summation sales in r programming?
Eg :
library("xts")
ep <- endpoints(data, on = 'days', k = 4)
period.apply(data,ep,sum)
it's not working. However, its taking start date to current date but I need to geatherd data from yestderday (19/4/2017) to start date and split into 4 equal parts.
kindly anyone guide me soon.
Thank you
Base R has a function cut.Date() which is built for the purpose.
However, the question is not fully clear on what the OP intends. My understanding of the requirements supplied in Q and additional comment is:
Take the sales data per day in Book1 but leave out the current day, i.e., use only completed days.
Group the data in four equal parts, i.e., four periods containing an equal number of days. (Note that the title of the Q and the attempt to use xts::endpoint() with k = 4 indicates that the OP might have a different intention to group the data in periods of four days length each.)
Summarize the sales figures by period
For the sake of brevity, data.table is used here for data manipulation and aggregation, lubridate for date manipulation
library(data.table)
library(lubridate)
# coerce to data.table, convert Date column from character to class Date,
# exclude the actual date
temp <- setDT(Book1)[, Date := mdy(Book1$Date)][Date != today()]
# cut the date range in four parts
temp[, start_date_of_period := cut.Date(Date, 4)]
temp
# Date Sales start_date_of_period
# 1: 2017-03-11 1 2017-03-11
# 2: 2017-03-12 0 2017-03-11
# 3: 2017-03-13 40 2017-03-11
# ...
#38: 2017-04-17 6 2017-04-10
#39: 2017-04-18 91 2017-04-10
#40: 2017-04-19 87 2017-04-10
# Date Sales start_date_of_period
# aggregate sales by period
temp[, .(n_days = .N, total_sales = sum(Sales)), by = start_date_of_period]
# start_date_of_period n_days total_sales
#1: 2017-03-11 10 348
#2: 2017-03-21 10 589
#3: 2017-03-31 10 462
#4: 2017-04-10 10 507
Thanks to chaining, this can be put together in one statement without using a temporary variable:
setDT(Book1)[, Date := mdy(Book1$Date)][Date != today()][
, start_date_of_period := cut.Date(Date, 4)][
, .(n_days = .N, total_sales = sum(Sales)), by = start_date_of_period]
Note If you want to reproduce the result in the future, you will have to replace the call to today() which excludes the current day by mdy("4/20/2017") which is the last day in the sample data set supplied by the OP.
I would like to create a column of 0s and 1s based on inequalities of three columns of dates.
The idea is the following. If event_date is before death_date or study_over, the the column event should be ==1, if event_date occurs after death_date or study_over, event should be == 0. Both event_date and death_date may contain NAs.
set.seed(1337)
rand_dates <- Sys.Date() - 365:1
df <-
data.frame(
event_date = sample(rand_dates, 20),
death_date = sample(rand_dates, 20),
study_over = sample(rand_dates, 20)
)
My attempt was the following
eventR <-
function(x, y, z){
if(is.na(y)){
ifelse(x <= z, 1, 0)
} else if(y <= z){
ifelse(x < y, 1, 0)
} else {
ifelse(x <= z, 1, 0)
}
}
I use it in the following manner
library(dplyr)
df[c(3, 5, 7), "event_date"] <- NA #there are some NA in .$event_date
df[c(3, 4, 6), "death_date"] <- NA #there are some NA in .$death_date
df %>%
mutate(event = sapply(.$event_date, eventR, y = .$death_date, z = .$study_over))
##Error: wrong result size (400), expected 20 or 1
##In addition: There were 40 warnings (use warnings() to see them)
I can't figure out how to do this. Any suggestions?
This would seem to construct a binary column (with NA's where needed) where 1 indicates "event_date is before death_date or study_over" and 0 is used elsewhere. As already pointed out your specification does not cover all cases:
df$event <- with(df, as.numeric( event_date < pmax( death_date , study_over) ) )
df
Can use pmap_dbl() from the purrr package instead of sapply...
library(dplyr)
library(purrr)
df %>% mutate(event = pmap_dbl(list(event_date, death_date, study_over), eventR))
event_date death_date study_over event
1 2016-10-20 2017-01-27 2016-12-16 1
2 2016-10-15 2016-12-12 2017-01-20 1
3 <NA> <NA> 2016-10-09 NA
4 2016-09-04 <NA> 2016-11-17 1
5 <NA> 2016-10-13 2016-06-09 NA
6 2016-07-21 <NA> 2016-04-26 0
7 <NA> 2017-02-21 2016-07-12 NA
8 2016-07-02 2017-02-08 2016-08-24 1
9 2016-06-19 2016-09-07 2016-04-11 0
10 2016-05-14 2017-03-13 2016-08-03 1
11 2017-03-06 2017-02-05 2017-02-28 0
12 2017-03-10 2016-04-28 2016-11-30 0
13 2017-01-10 2016-12-10 2016-10-27 0
14 2016-05-31 2016-06-12 2016-08-13 1
15 2017-03-03 2016-12-25 2016-12-20 0
16 2016-04-01 2016-11-03 2016-06-30 1
17 2017-02-26 2017-02-25 2016-05-12 0
18 2017-02-08 2016-12-08 2016-10-14 0
19 2016-07-19 2016-07-03 2016-09-22 0
20 2016-06-17 2016-06-06 2016-11-09 0
You might also be interested in the dplyr function, case_when() for handling many if else statements.
Consider this
time <- seq(ymd_hms("2014-02-24 23:00:00"), ymd_hms("2014-06-25 08:32:00"), by="hour")
group <- rep(LETTERS[1:20], each = length(time))
value <- sample(-10^3:10^3,length(time), replace=TRUE)
df2 <- data.frame(time,group,value)
str(df2)
> head(df2)
time group value
1 2014-02-24 23:00:00 A 246
2 2014-02-25 00:00:00 A -261
3 2014-02-25 01:00:00 A 628
4 2014-02-25 02:00:00 A 429
5 2014-02-25 03:00:00 A -49
6 2014-02-25 04:00:00 A -749
I would like to create a variable that contains, for each group, the rolling mean of value
over the last 5 days (not including the current observation)
only considering observations that fall at the exact same hour as the current observation.
In other words:
At time 2014-02-24 23:00:00, df2['rolling_mean_same_hour'] contains the mean of the values of value observed at 23:00:00 during the last 5 days in the data (not including 2014-02-24 of course).
I would like to do that in either dplyr or data.table. I confess having no ideas how to do that.
Any ideas?
Many thanks!
You can calculate the rollmean() with your data grouped by the group variable and hour of the time variable, normally the rollmean() will include the current observation, but you can use shift() function to exclude the current observation from the rollmean:
library(data.table); library(zoo)
setDT(df2)
df2[, .(rolling_mean_same_hour = shift(
rollmean(value, 5, na.pad = TRUE, align = 'right'),
n = 1,
type = 'lag'),
time), .(hour(time), group)]
# hour group rolling_mean_same_hour time
# 1: 23 A NA 2014-02-24 23:00:00
# 2: 23 A NA 2014-02-25 23:00:00
# 3: 23 A NA 2014-02-26 23:00:00
# 4: 23 A NA 2014-02-27 23:00:00
# 5: 23 A NA 2014-02-28 23:00:00
# ---
#57796: 22 T -267.0 2014-06-20 22:00:00
#57797: 22 T -389.6 2014-06-21 22:00:00
#57798: 22 T -311.6 2014-06-22 22:00:00
#57799: 22 T -260.0 2014-06-23 22:00:00
#57800: 22 T -26.8 2014-06-24 22:00:00