I would like to calculate mean every 5 rows in my df. Here is my df :
Time
value
03/06/2021 06:15:00
NA
03/06/2021 06:16:00
NA
03/06/2021 06:17:00
20
03/06/2021 06:18:00
22
03/06/2021 06:19:00
25
03/06/2021 06:20:00
NA
03/06/2021 06:21:00
31
03/06/2021 06:22:00
23
03/06/2021 06:23:00
19
03/06/2021 06:24:00
25
03/06/2021 06:25:00
34
03/06/2021 06:26:00
42
03/06/2021 06:27:00
NA
03/06/2021 06:28:00
19
03/06/2021 06:29:00
17
03/06/2021 06:30:00
25
I already have a loop which goes well to calculate means for each 5 rows package. My problem is in my "mean function".
The problem is :
-if I put na.rm = FALSE, mean = NA as soon as there is a NA in a package of 5 values.
- if I put na.rm = TRUE in mean function, the result gives me averages that are shifted to take 5 values. I would like the NA not to interfere with the average and that when there is a NA in a package of 5 values, the average is only done on 4 values.
How can I do this? Thanks for your help !
You can solve your problem by introducing a dummy variable that groups your observarions in sets of five and then calculating the mean within group. Here's MWE, based in the tidyverse, that assumes your data is in a data.frame named df.
library(tidyverse)
df %>%
mutate(Group= 1 + floor((row_number()-1) / 5)) %>%
group_by(Group) %>%
summarise(Mean=mean(value, na.rm=TRUE), .groups="drop")
# A tibble: 4 × 2
Group Mean
<dbl> <dbl>
1 1 22.3
2 2 24.5
3 3 28
4 4 25
A solution based on purrr::map_dfr:
library(purrr)
df <- data.frame(
stringsAsFactors = FALSE,
time = c("03/06/2021 06:15:00","03/06/2021 06:16:00",
"03/06/2021 06:17:00",
"03/06/2021 06:18:00","03/06/2021 06:19:00",
"03/06/2021 06:20:00","03/06/2021 06:21:00",
"03/06/2021 06:22:00","03/06/2021 06:23:00",
"03/06/2021 06:24:00","03/06/2021 06:25:00",
"03/06/2021 06:26:00",
"03/06/2021 06:27:00","03/06/2021 06:28:00",
"03/06/2021 06:29:00","03/06/2021 06:30:00"),
value = c(NA,NA,20L,22L,
25L,NA,31L,23L,19L,25L,34L,42L,NA,19L,17L,
25L)
)
map_dfr(1:(nrow(df)-5),
~ data.frame(Group =.x, Mean = mean(df$value[.x:(.x+5)],na.rm=T)))
#> Group Mean
#> 1 1 22.33333
#> 2 2 24.50000
#> 3 3 24.20000
#> 4 4 24.00000
#> 5 5 24.60000
#> 6 6 26.40000
#> 7 7 29.00000
#> 8 8 28.60000
#> 9 9 27.80000
#> 10 10 27.40000
#> 11 11 27.40000
If you want to take average of every 5 minutes you may use lubridate's function floor_date/ceiling_date to round the time.
library(dplyr)
library(lubridate)
df %>%
mutate(time = mdy_hms(time),
time = floor_date(time, '5 mins')) %>%
group_by(time) %>%
summarise(value = mean(value, na.rm = TRUE))
# time value
# <dttm> <dbl>
#1 2021-03-06 06:15:00 22.3
#2 2021-03-06 06:20:00 24.5
#3 2021-03-06 06:25:00 28
#4 2021-03-06 06:30:00 25
Related
I need to calculate the rolling 14 day average for a large data set. The data set is private, although I can share a small snippet.
The data set comes from an instrument in the field which does not operate every day. For instance, a snippet of the data frame would look like so:
Date, Value
2022-01-28, 196.00000
2022-01-31, 104.00000
2022-02-01, 0.00000
2022-02-02, 98.00000
2022-02-03, 0.00000
2022-02-07, 139.92308
2022-02-08, 114.50000
2022-02-09, 121.64286
2022-02-10, 96.50000
2022-02-11, 151.63636
2022-02-14, 85.87500
2022-02-15, 98.90000
2022-02-18, 209.40000
2022-02-21, 172.18182
2022-02-22, 0.00000
2022-02-23, 0.00000
2022-02-28, 264.00000
2022-03-01, 131.75000
2022-03-03, 119.33333
2022-03-04, 88.80000
2022-03-07, 152.16667
2022-03-08, 24.50000
I have the following plot.
library(zoo)
library(tidyverse)
ggplot(data=df_days, aes(x=Date, y=Value)) +
geom_line(color="black", lwd=0.5) +
geom_point(lwd=0.5) +
geom_line(y=rollmean(df_days$Value, 14, na.pad=TRUE), color="red", lwd=0.8)
I realised that I'm actually taking the 14 point average, i.e the average of 14 data points. Is there a way to take the 14 day average, based upon the dates themselves?
1) Using the input from the question shown reproducibly in the Note at the end we calculate the number of points to use at each date, w, and then use rollapplyr with that.
library(zoo)
within(DF, {
w <- seq_along(Date) - findInterval(Date - 14, Date)
mean14 <- rollapplyr(Value, w, mean)
})
giving the following where mean14 is the mean and w is the number of points used to calculate that mean. This is calculated in such a way that if there were no missing dates then it would give the same result as rollapplyr(DF$Value, 14, mean, partial = TRUE) but if there are missing dates then it uses fewer based on the number of dates in a 14 day window. (Note that using different numbers of points for each mean can affect the variance.)
Date Value mean14 w
1 2022-01-28 196.0000 196.00000 1
2 2022-01-31 104.0000 150.00000 2
3 2022-02-01 0.0000 100.00000 3
4 2022-02-02 98.0000 99.50000 4
5 2022-02-03 0.0000 79.60000 5
6 2022-02-07 139.9231 89.65385 6
7 2022-02-08 114.5000 93.20330 7
8 2022-02-09 121.6429 96.75824 8
9 2022-02-10 96.5000 96.72955 9
10 2022-02-11 151.6364 91.80026 9
11 2022-02-14 85.8750 89.78637 9
12 2022-02-15 98.9000 100.77526 9
13 2022-02-18 209.4000 127.29716 8
14 2022-02-21 172.1818 131.32951 8
15 2022-02-22 0.0000 117.01700 8
16 2022-02-23 0.0000 101.81165 8
17 2022-02-28 264.0000 124.08030 6
18 2022-03-01 131.7500 129.55530 6
19 2022-03-03 119.3333 128.09502 7
20 2022-03-04 88.8000 110.86645 7
21 2022-03-07 152.1667 108.00714 7
22 2022-03-08 24.5000 111.50714 7
2) Another approach is to add the missing dates, fill in Value in those missing dates with NA and then use rollapplyr.
m <- merge(DF, data.frame(Date = seq(min(DF$Date), max(DF$Date), 1)), all = TRUE)
na.omit(transform(m,
mean14 = rollapplyr(Value, 14, mean, na.rm = TRUE, partial = TRUE)))
3) A variation of the above is to use zoo objects. Note that fortify.zoo(zz) can be used to create a data frame from a zoo object.
library(zoo)
z <- read.zoo(DF)
# 1
tt <- time(z)
w <- seq_along(tt) - findInterval(tt - 14, tt)
zz <- rollapplyr(z, w, mean)
# 2
m <- merge(z, zoo(, seq(start(z), end(z), 1)))
zz <- na.omit(rollapply(m, 14, mean, na.rm = TRUE))
Note
Lines <- "Date, Value
2022-01-28, 196.00000
2022-01-31, 104.00000
2022-02-01, 0.00000
2022-02-02, 98.00000
2022-02-03, 0.00000
2022-02-07, 139.92308
2022-02-08, 114.50000
2022-02-09, 121.64286
2022-02-10, 96.50000
2022-02-11, 151.63636
2022-02-14, 85.87500
2022-02-15, 98.90000
2022-02-18, 209.40000
2022-02-21, 172.18182
2022-02-22, 0.00000
2022-02-23, 0.00000
2022-02-28, 264.00000
2022-03-01, 131.75000
2022-03-03, 119.33333
2022-03-04, 88.80000
2022-03-07, 152.16667
2022-03-08, 24.50000"
DF <- read.csv(text = Lines)
DF$Date <- as.Date(DF$Date)
There may be more elegant solutions, but you can fill in the missing dates with NA:
df$Date <- as.Date(df$Date)
library(dplyr)
library(tidyr)
df %>% complete(Date = seq(min(Date),max(Date),1), fill = list(Value = NA))
Output:
# A tibble: 40 × 2
# Date Value
# <date> <dbl>
# 1 2022-01-28 196
# 2 2022-01-29 NA
# 3 2022-01-30 NA
# 4 2022-01-31 104
# 5 2022-02-01 0
# 6 2022-02-02 98
# ...
I have a dataframe that looks like this:
CYCLE date_cycle Randomization_Date COUPLEID
1 0 2016-02-16 10892
2 1 2016-08-17 2016-02-19 10894
3 1 2016-08-14 2016-02-26 10899
4 1 2016-02-26 10900
5 2 2016-03--- 2016-02-26 10900
6 3 2016-07-19 2016-02-26 10900
7 4 2016-11-15 2016-02-26 10900
8 1 2016-02-27 10901
9 2 2016-02--- 2016-02-27 10901
10 1 2016-03-27 2016-03-03 10902
11 2 2016-04-21 2016-03-03 10902
12 1 2016-03-03 10903
13 2 2016-03--- 2016-03-03 10903
14 0 2016-03-03 10904
15 1 2016-03-03 10905
16 2 2016-03-03 10905
17 3 2016-03-03 10905
18 4 2016-04-14 2016-03-03 10905
19 5 2016-05--- 2016-03-03 10905
20 6 2016-06--- 2016-03-03 10905
The goal is to fill in the missing day for a given ID using either an earlier or later date and add/subtract 28 from that.
The date_cycle variable was originally in the dataframe as a character type.
I have tried to code it as follows:
mutate(rowwise(df),
newdate = case_when( str_count(date1, pattern = "\\W") >2 ~ lag(as.Date.character(date1, "%Y-%m-%d"),1) + days(28)))
But I need to incorporate it by ID by CYCLE.
An example of my data could be made like this:
data.frame(stringsAsFactors = FALSE,
CYCLE =(0,1,1,1,2,3,4,1,2,1,2,1,2,0,1,2,3,4,5,6),
date_cycle = c(NA,"2016-08-17", "2016-08-14",NA,"2016-03---","2016-07-19", "2016-11-15",NA,"2016-02---", "2016-03-27","2016-04-21",NA, "2016-03---",NA,NA,NA,NA,"2016-04-14", "2016-05---","2016-06---"), Randomization_Date = c("2016-02-16","2016-02-19",
"2016-02-26","2016-02-26",
"2016-02-26","2016-02-26",
"2016-02-26",
"2016-02-27","2016-02-27",
"2016-03-03",
"2016-03-03","2016-03-03",
"2016-03-03","2016-03-03",
"2016-03-03",
"2016-03-03","2016-03-03",
"2016-03-03",
"2016-03-03","2016-03-03"),
COUPLEID = c(10892,10894,10899,10900,
10900,10900,10900,10901,10901,
10902,10902,10903,10903,10904,
10905,10905,10905,10905,10905,10905)
)
The output I am after would look like:
COUPLEID CYCLE date_cycle new_date_cycle
a 1 2014-03-27 2014-03-27
a 1 2014-04--- 2014-04-24
b 1 2014-03-24 2014-03-24
b 2 2014-04-21
b 3 2014-05--- 2014-05-19
c 1 2014-04--- 2014-04-02
c 2 2014-04-30 2014-04-30
I have also started to make a long conditional, but I wanted to ask here and see if anyone new of a more straight forward way to do it, instead of explicitly writing out all of the possible conditions.
mutate(rowwise(df),
newdate = case_when(
grp == 1 & str_count(date1, pattern = "\\W") >2 & !is.na(lead(date1,1) ~ lead(date1,1) - days(28),
grp == 2 & str_count(date1, pattern = "\\W") >2 & !is.na(lead(date1,1)) ~ lead(date1,1) - days(28),
grp == 3 & str_count(date1, pattern = "\\W") >2 & ...)))
Function to fill dates forward and backwards
filldates <- function(dates) {
m = which(!is.na(dates))
if(length(m)>0 & length(m)!=length(dates)) {
if(m[1]>1) for(i in seq(m,1,-1)) if(is.na(dates[i])) dates[i]=dates[i+1]-28
if(sum(is.na(dates))>0) for(i in seq_along(dates)) if(is.na(dates[i])) dates[i] = dates[i-1]+28
}
return(dates)
}
Usage:
data %>%
arrange(ID, grp) %>%
group_by(ID) %>%
mutate(date2=filldates(as.Date(date1,"%Y-%m-%d")))
Ouput:
ID grp date1 date2
<chr> <dbl> <chr> <date>
1 a 1 2014-03-27 2014-03-27
2 a 2 2014-04--- 2014-04-24
3 b 1 2014-03-24 2014-03-24
4 b 2 2014-04--- 2014-04-21
5 b 3 2014-05--- 2014-05-19
6 c 1 2014-03--- 2014-04-02
7 c 2 2014-04-30 2014-04-30
An option using purrr::accumulate().
library(tidyverse)
center <- df %>%
group_by(ID) %>%
mutate(helpDate = ymd(str_replace(date1, '---', '-01')),
refDate = max(ymd(date1), na.rm = T))
backward <- center %>%
filter(refDate == max(helpDate)) %>%
mutate(date2 = accumulate(refDate, ~ . - days(28), .dir = 'backward'))
forward <- center %>%
filter(refDate == min(helpDate)) %>%
mutate(date2 = accumulate(refDate, ~ . + days(28)))
bind_rows(forward, backward) %>%
ungroup() %>%
mutate(date2 = as_date(date2)) %>%
select(-c('helpDate', 'refDate'))
# # A tibble: 7 x 4
# ID grp date1 date2
# <chr> <int> <chr> <date>
# 1 a 1 2014-03-27 2014-03-27
# 2 a 2 2014-04--- 2014-04-24
# 3 b 1 2014-03-24 2014-03-24
# 4 b 2 2014-04--- 2014-04-21
# 5 b 3 2014-05--- 2014-05-19
# 6 c 1 2014-03--- 2014-04-02
# 7 c 2 2014-04-30 2014-04-30
As default I set the argument cut.points as NA and if it's on default then it shouldn't do anything with the data.
But if user decides to put for example cut.points = c("2012-01-01", "2013-01-01") then the data should be filtered by the column that has dates in it. And it should return only dates between 2012 to 2013.
The problem is that I'm reading data from the function so in theory i won't know what is the name of this date column that uses provides. So i find the column with dates and store it's name in the variable.
But the condition which i wrote that should filter based od this variable doesn't work:
modifier <- function(input.data, cut.points = c(NA, NA)) {
date_check <- sapply(input.data, function(x) !all(is.na(as.Date(as.character(x),format="%Y-%m-%d"))))
if (missing(cut.points)) {
input.data
} else {
cols <- colnames(select_if(input.data, date_check == TRUE))
cut.points <- as.Date(cut.points)
input.data <- filter(input.data, cols > cut.points[1] & cols < cut.points[2])
}
}
for ex. when i try to run this:
modifier(ex_data, cut.points = c("2012-01-01", "2013-01-01"))
On sample like this:
ex_data
Row.ID Order.ID Order.Date
1 32298 CA-2012-124891 2012-07-31
2 26341 IN-2013-77878 2013-02-05
3 25330 IN-2013-71249 2013-10-17
4 13524 ES-2013-1579342 2013-01-28
5 47221 SG-2013-4320 2013-11-05
6 22732 IN-2013-42360 2013-06-28
7 30570 IN-2011-81826 2011-11-07
8 31192 IN-2012-86369 2012-04-14
9 40155 CA-2014-135909 2014-10-14
10 40936 CA-2012-116638 2012-01-28
11 34577 CA-2011-102988 2011-04-05
12 28879 ID-2012-28402 2012-04-19
13 45794 SA-2011-1830 2011-12-27
14 4132 MX-2012-130015 2012-11-13
15 27704 IN-2013-73951 2013-06-06
16 13779 ES-2014-5099955 2014-07-31
17 36178 CA-2014-143567 2014-11-03
18 12069 ES-2014-1651774 2014-09-08
19 22096 IN-2014-11763 2014-01-31
20 49463 TZ-2014-8190 2014-12-05
the error is:
character string is not in a standard unambiguous format
I've added lubridateas a dependency so I could get access to %within% and is.Date. I've also changed the check condition, because I don't think your original one would work with NA, NA.
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
ex_data <- read_table(" Row.ID Order.ID Order.Date
1 32298 CA-2012-124891 2012-07-31
2 26341 IN-2013-77878 2013-02-05
3 25330 IN-2013-71249 2013-10-17
4 13524 ES-2013-1579342 2013-01-28
5 47221 SG-2013-4320 2013-11-05
6 22732 IN-2013-42360 2013-06-28
7 30570 IN-2011-81826 2011-11-07
8 31192 IN-2012-86369 2012-04-14
9 40155 CA-2014-135909 2014-10-14
10 40936 CA-2012-116638 2012-01-28
11 34577 CA-2011-102988 2011-04-05
12 28879 ID-2012-28402 2012-04-19
13 45794 SA-2011-1830 2011-12-27
14 4132 MX-2012-130015 2012-11-13
15 27704 IN-2013-73951 2013-06-06
16 13779 ES-2014-5099955 2014-07-31
17 36178 CA-2014-143567 2014-11-03
18 12069 ES-2014-1651774 2014-09-08
19 22096 IN-2014-11763 2014-01-31
20 49463 TZ-2014-8190 2014-12-05")
#> Warning: Missing column names filled in: 'X1' [1]
modifier <- function(input.data, cut.points = NULL) {
if (length(cut.points) == 2) {
date_col <- colnames(input.data)[sapply(input.data, is.Date)]
filtered.data <- input.data %>%
rename(Date = !! date_col) %>%
filter(Date %within% interval(cut.points[1], cut.points[2])) %>%
rename_with(~ date_col, Date)
return(filtered.data)
} else {
input.data
}
}
modifier(ex_data, cut.points = c("2012-01-01", "2013-01-01"))
#> # A tibble: 5 x 4
#> X1 Row.ID Order.ID Order.Date
#> <dbl> <dbl> <chr> <date>
#> 1 1 32298 CA-2012-124891 2012-07-31
#> 2 8 31192 IN-2012-86369 2012-04-14
#> 3 10 40936 CA-2012-116638 2012-01-28
#> 4 12 28879 ID-2012-28402 2012-04-19
#> 5 14 4132 MX-2012-130015 2012-11-13
I would like to achieve the following:
filter dataframe catalogs based on multiple columns in dataframe orders, for each row in dataframe orders and store the result in a list column in dataframe orders. (succeeded)
calculate the difference between a date in data frame orders and another date in the new listcolumn.
Table s_orders contains order data for different people (account keys). Table s_catalogs contains all catalogs that were sent to each account key
For each order, I want to know:
if and what catalogs were sent from the previous order (or the beginning) until the day before the focal order. More specifically, consumers received a (paper) catalog at s_catalogs$CATDATE. I want to know for each order what catalogs were received between the previous order (s_orders$PREVORDER) and the latest order. Because some consumers do not have a previous order I set the previous order date startdate to date("1999-12-31") which is the beginning of my dataset.
Then I want to do some calculations on the catalog data. (in this example: calculate the difference between date of a catalog and the order date)
For this, I have written a function getCatalogs, which takes the account key and two dates as input, and outputs a dataframe with the results from the other table. Would be much appreciated if someone has a better, more efficient solution? maybe with some sort of join?
I think my main problem is how to use mutate, pmap, pipes, pluck interchangeably for building complex queries on multiple tables.
My actual problem is outlined in sections Desired result and Problem.
# packages needed
library("dplyr")
library("lubridate")
library("purrr")
#library("tidyverse")
Example data
( i sampled some users from my data. s_ stands for 'sample')
# orders
s_orders <- structure(list(ACCNTKEY = c(2806, 2806, 2806, 3729, 3729, 3729,
3729, 4607, 4607, 4607, 4607, 4742, 11040, 11040, 11040, 11040,
11040, 17384), ORDDATE = structure(c(11325, 11703, 11709, 11330,
11375, 11384, 12153, 11332, 11445, 11589, 11713, 11333, 11353,
11429, 11662, 11868, 11960, 11382), class = "Date")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -18L))
# # A tibble: 18 x 2
# ACCNTKEY ORDDATE
# <dbl> <date>
# 1 2806 2001-01-03
# 2 2806 2002-01-16
# 3 2806 2002-01-22
# 4 3729 2001-01-08
# 5 3729 2001-02-22
# 6 3729 2001-03-03
# 7 3729 2003-04-11
# 8 4607 2001-01-10
# 9 4607 2001-05-03
# 10 4607 2001-09-24
# 11 4607 2002-01-26
# 12 4742 2001-01-11
# 13 11040 2001-01-31
# 14 11040 2001-04-17
# 15 11040 2001-12-06
# 16 11040 2002-06-30
# 17 11040 2002-09-30
# 18 17384 2001-03-01
# catalogs
s_catalogs <- structure(list(ACCNTKEY = c("2806", "2806", "4607", "2806", "4607",
"4607", "4607"), CATDATE = structure(c(11480, 11494, 11522, 11858,
11886, 12264, 12250), class = "Date"), CODE = c("2806/07/2001",
"2806/21/2001", "4607/19/2001", "2806/20/2002", "4607/18/2002",
"4607/31/2003", "4607/17/2003")), row.names = c(NA, -7L), class = c("tbl_df",
"tbl", "data.frame"))
# # A tibble: 7 x 3
# ACCNTKEY CATDATE CODE
# <chr> <date> <chr>
# 1 2806 2001-06-07 2806/07/2001
# 2 2806 2001-06-21 2806/21/2001
# 3 4607 2001-07-19 4607/19/2001
# 4 2806 2002-06-20 2806/20/2002
# 5 4607 2002-07-18 4607/18/2002
# 6 4607 2003-07-31 4607/31/2003
# 7 4607 2003-07-17 4607/17/2003
calculate the lagged order date
# calculate previous order date for each order in s_orders
s_orders<-s_orders %>%
group_by(ACCNTKEY) %>%
arrange(ORDDATE) %>%
mutate(PREVORDER=as_date(lag(ORDDATE)))
So now we know the previous order (if any)
Function getCatalogs (improvement appreciated)
So the below function getCatalogs returns a dataframe with the catalogs that were received by that account key before the order (or actually in between the last orders/catalogs that were received between startdate and enddate).
# in case _startdate_ is missing then I set it to some starting value
getCatalogs<-function(key,startdate,enddate){
if(is.na(startdate)){
startdate<-as_date(date("1999-12-31"))
}
tmp <- s_catalogs[s_catalogs$ACCNTKEY==key &
s_catalogs$CATDATE<enddate &
s_catalogs$CATDATE>=startdate,]
if (NROW(tmp)>0){
return(tmp)
}else{return(NA)}
}
Use the function
let's get for each order all catalogs in a listcolumn
# For each row in s_orders search in dataframe s_catalogs all catalogs that were received for that account key before the order date but after the previous order.
s_orders <- s_orders %>% as_tibble() %>%
mutate(catalogs =
pmap(c(list(ACCNTKEY),list(PREVORDER),list(ORDDATE)),.f= function(x,y,z){getCatalogs(x,y,z)}))
This line for example gets the date of the latest catalog, which is what i need:
s_orders %>% pluck("catalogs") %>% pluck(13) %>% pluck("CATDATE") %>% max()
# [1] "2001-06-21"
Desired result:
Now I would like to retrieve the number of days between the above date and the date of the order (ORDDATE). The following code does it exactly but it is only correct in row 13.
# get amount of days since last catalog
s_orders3 <- s_orders %>%
mutate(diff = ORDDATE - s_orders %>%
pluck("catalogs") %>% pluck(13) %>% pluck("CATDATE") %>% max())
# # A tibble: 18 x 5
# ACCNTKEY ORDDATE PREVORDER catalogs diff
# <dbl> <date> <date> <list> <time>
# 1 2806 2001-01-03 NA <lgl [1]> -169 days
# 2 3729 2001-01-08 NA <lgl [1]> -164 days
# 3 4607 2001-01-10 NA <lgl [1]> -162 days
# 4 4742 2001-01-11 NA <lgl [1]> -161 days
# 5 11040 2001-01-31 NA <lgl [1]> -141 days
# 6 3729 2001-02-22 2001-01-08 <lgl [1]> -119 days
# 7 17384 2001-03-01 NA <lgl [1]> -112 days
# 8 3729 2001-03-03 2001-02-22 <lgl [1]> -110 days
# 9 11040 2001-04-17 2001-01-31 <lgl [1]> -65 days
# 10 4607 2001-05-03 2001-01-10 <lgl [1]> -49 days
# 11 4607 2001-09-24 2001-05-03 <tibble [1 × 3]> 95 days
# 12 11040 2001-12-06 2001-04-17 <lgl [1]> 168 days
# 13 2806 2002-01-16 2001-01-03 <tibble [2 × 3]> 209 days
# 14 2806 2002-01-22 2002-01-16 <lgl [1]> 215 days
# 15 4607 2002-01-26 2001-09-24 <lgl [1]> 219 days
# 16 11040 2002-06-30 2001-12-06 <lgl [1]> 374 days
# 17 11040 2002-09-30 2002-06-30 <lgl [1]> 466 days
# 18 3729 2003-04-11 2001-03-03 <lgl [1]> 659 days
Check manually:
date("2002-01-16")-date("2001-06-21")
# Time difference of 209 days
Problem
However, the code subtracts the same date from order date in every row. I want it to use the date that belongs to each particular row.
So the problem is how to replace the %>% pluck(13) %>% by some command that dows this trick to every row and put it in the diff column.
I am really searching for a solution that uses either purrr or dplyr or some other package that is just as efficient and clear.
Hoping that I have understood the question clearly, here is my attempt trying to solve the problem. I changed the getCatalogs function to return only max CATDATE in case if it is present.
library(dplyr)
library(purrr)
getCatalogs<-function(key,startdate,enddate){
if(is.na(startdate)) startdate<- as.Date("1999-12-31")
tmp <- s_catalogs$CATDATE[s_catalogs$ACCNTKEY==key &
s_catalogs$CATDATE<enddate &
s_catalogs$CATDATE>=startdate]
if (length(tmp) > 0) max(tmp) else NA
}
s1_orders<- s_orders %>%
group_by(ACCNTKEY) %>%
arrange(ORDDATE) %>%
mutate(PREVORDER=lag(ORDDATE))
and then use pmap like :
s1_orders %>%
mutate(catalogs = pmap_dbl(list(ACCNTKEY,PREVORDER,ORDDATE), getCatalogs),
catalogs = as.Date(catalogs, origin = "1970-01-01"),
diff = ORDDATE - catalogs)
# ACCNTKEY ORDDATE PREVORDER catalogs diff
# <dbl> <date> <date> <date> <drtn>
# 1 2806 2001-01-03 NA NA NA days
# 2 3729 2001-01-08 NA NA NA days
# 3 4607 2001-01-10 NA NA NA days
# 4 4742 2001-01-11 NA NA NA days
# 5 11040 2001-01-31 NA NA NA days
# 6 3729 2001-02-22 2001-01-08 NA NA days
# 7 17384 2001-03-01 NA NA NA days
# 8 3729 2001-03-03 2001-02-22 NA NA days
# 9 11040 2001-04-17 2001-01-31 NA NA days
#10 4607 2001-05-03 2001-01-10 NA NA days
#11 4607 2001-09-24 2001-05-03 2001-07-19 67 days
#12 11040 2001-12-06 2001-04-17 NA NA days
#13 2806 2002-01-16 2001-01-03 2001-06-21 209 days
#14 2806 2002-01-22 2002-01-16 NA NA days
#15 4607 2002-01-26 2001-09-24 NA NA days
#16 11040 2002-06-30 2001-12-06 NA NA days
#17 11040 2002-09-30 2002-06-30 NA NA days
#18 3729 2003-04-11 2001-03-03 NA NA days
Update
Without changing the current getCatalogs function, we can test the length of catalogs
s1_orders %>%
mutate(catalogs = pmap(list(ACCNTKEY,PREVORDER,ORDDATE), getCatalogs),
temp = map_dbl(catalogs, ~if (length(.x) > 1)
.x %>% pluck("CATDATE") %>% max else NA),
temp = as.Date(temp, origin = "1970-01-01"),
diff = ORDDATE - temp)
My data looks like this:
date rmean
1/2/2004 6
1/5/2004 30
1/6/2004 27
1/7/2004 20
1/8/2004 10
1/9/2004 22
1/12/2004 21
1/13/2004 18
1/14/2004 19
1/15/2004 7
1/16/2004 9
1/19/2004 11
1/20/2004 18
1/21/2004 26
1/26/2004 8
1/27/2004 16
1/28/2004 19
1/29/2004 4
1/30/2004 1
2/3/2004 11
2/4/2004 9
2/5/2004 26
2/6/2004 16
2/9/2004 25
2/10/2004 2
2/11/2004 6
2/12/2004 2
2/13/2004 25
2/16/2004 17
2/17/2004 21
2/18/2004 26
2/19/2004 6
2/20/2004 14
2/23/2004 4
2/24/2004 7
2/25/2004 19
2/26/2004 10
2/27/2004 23
I want to find the rmean of (20 days + 15th of each month).
Note: if there isn't a value for rmean of that date in my data (some days are skipped), i want it to find the rmean of closest day of the
something like this but ( 20 + 15th of each month) instead of 15 :
dt <- Dataframe[, list(day15=abs(mday(date)-15) == min(abs(mday(date)-15)),
date, rmean), by=list(year(date), month(date))]
dt[day15==TRUE]
Finale = dt[day15==TRUE , .SD[1,] ,by=list(month, year)]
The expected output for my example above:
date rmean
2/4/2004 9
Here's one way to do it with base R.
First, some dummy data:
d <- data.frame(date=as.Date('1/1/2004', '%d/%m/%Y') + sort(sample(364, 200)),
x=runif(200))
head(d)
# date x
# 1 2004-01-02 0.29818227
# 2 2004-01-03 0.12543617
# 3 2004-01-04 0.78145310
# 4 2004-01-05 0.30456904
# 5 2004-01-06 0.45228066
# 6 2004-01-07 0.07511554
Calculate arrival dates within the date range of the data:
arrival <-
seq(as.Date(sprintf('15/%s', format(min(d$date), '%m/%Y')), '%d/%m/%Y'),
as.Date(sprintf('15/%s', format(max(d$date), '%m/%Y')), '%d/%m/%Y'),
by='month') + 20
arrival
# [1] "2004-02-04" "2004-03-06" "2004-04-04" "2004-05-05" "2004-06-04" "2004-07-05"
# [7] "2004-08-04" "2004-09-04" "2004-10-05" "2004-11-04" "2004-12-05" "2005-01-04"
Find the closest date to each of the arrival dates (taking that with max x value if there are two closest dates), and return a data.frame with the "arrival" dates, the closest dates to each of these arrival dates, and the corresponding values of x.
cbind(arrival, do.call(rbind, lapply(arrival, function(x) {
closest <- which(abs(d$date - x) == min(abs(d$date - x)))
d[closest[which.max(d$x[closest])], ]
})))
# arrival date x
# 25 2004-02-04 2004-02-03 0.78836413
# 45 2004-03-06 2004-03-06 0.61214949
# 63 2004-04-04 2004-04-04 0.49171847
# 79 2004-05-05 2004-05-05 0.02989788
# 93 2004-06-04 2004-06-04 0.25923715
# 109 2004-07-05 2004-07-05 0.90330331
# 120 2004-08-04 2004-08-04 0.48133237
# 139 2004-09-04 2004-09-03 0.12280267
# 151 2004-10-05 2004-10-03 0.46888891
# 169 2004-11-04 2004-11-04 0.40397949
# 186 2004-12-05 2004-12-04 0.18685615
# 200 2005-01-04 2004-12-30 0.97462347