Filter grouped data

Filter grouped data - r

I have a data.frame that looks like this
DATE MEAN SUM MAX MIN SAISON JAHR
1 1995-09-01 00:00:00 2.370833 56.9 7.4 0 S 1995
2 1995-09-01 01:00:00 2.225000 53.4 7.4 0 S 1995
3 1995-09-01 02:00:00 2.091667 50.2 7.4 0 S 1995
4 1995-09-01 03:00:00 1.929167 46.3 7.4 0 S 1995
5 1995-09-01 04:00:00 1.745833 41.9 7.4 0 S 1995
6 1995-09-01 05:00:00 1.558333 37.4 7.4 0 S 1995
....
With the dplyr package I am able to extract the highest SUM for every SAISON and JAHR:
group_by(.data = dataframe,JAHR,SAISON)
summarise(gJahrSAISON_24, hoechsterNiederschlag = max(SUM))
Do you have any idea how to extract the ten(!) highest sums for every JAHR and SAISON?

You can use slice with arrange
library(dplyr)
df1 %>%
group_by(JAHR, SAISON) %>%
arrange(desc(SUM)) %>%
slice(1:10)
Or filter with min_rank/dense_rank
df1 %>%
group_by(JAHR, SAISON) %>%
filter(dense_rank(SUM)<=10)
Similar options using data.table are
library(data.table)#v1.9.5+
setDT(df1)[order(-SUM), .SD[1:10], by = .(JAHR, SAISON)]
Or
setDT(df1)[, .SD[frank(SUM, ties.method='first') <=10], by = .(JAHR, SAISON)]
Or using sqldf
library(sqldf)
sqldf('select * from df1 i
where rowid in
(select rowid from df1
where JAHR = i.JAHR and SAISON=i.SAISON
order by SUM desc
limit 10)
order by i.JAHR, i.SAISON, i.SUM desc')
Or with base R
df1[with(df1, ave(SUM, SAISON, JAHR, FUN=function(x)
rank(-x, ties.method='first'))<=10),]

Related

How to fill dates between two dates [duplicate]

This question already has answers here:
Expand ranges defined by "from" and "to" columns
(10 answers)
Closed 1 year ago.
Here is what my current dataframe looks like:
df <- data.frame(name = c("A", "A", "A", "B", "B")),
start_date = c("2020-01-23", "2019-10-15", "2019-07-28", "2020-03-15", "2019-04-23")),
end_date = c("2020-05-15", "2020-01-27", "2019-10-17", "2020-07-25", "2020-02-13")),
value = c(8.1, 3.3, 9.1, 9.4, 15.3)))
name start_date end_date value
A 2020-01-23 2020-05-15 8
A 2019-10-15 2020-01-27 3
A 2019-07-28 2019-10-17 9
B 2020-03-15 2020-07-25 9
B 2019-04-23 2020-02-13 15
The dates are in POSIXct, are not necessarily consecutive, and can overlap.
I would like my output dataframe to look something like this:
name date value
A 2020-01-23 8.1
A 2020-01-24 8.1
A ... 8.1
A 2020-05-14 8.1
A 2020-05-15 8.1
A 2019-10-15 3.3
A 2019-10-16 3.3
A ... 3.3
A 2020-01-26 3.3
A 2020-01-27 3.3
A 2019-07-28 9.1
A 2019-07-29 9.1
A ... 9.1
A 2019-10-16 9.1
A 2019-10-17 9.1
B 2020-03-15 9.4
B 2020-03-16 9.4
B ... 9.4
B 2020-07-24 9.4
B 2020-07-25 9.4
B 2019-04-23 15.3
B 2019-04-24 15.3
B ... 15.3
B 2020-02-12 15.3
B 2020-02-13 15.3
Here is what I have been trying:
library(data.table)
setDT(df) [, .(date = seq(as.Date(start_date), as.Date(end_date), by = "day")), by = end_date]
But I have been getting the following error:
Error in seq.Date(as.Date(start_date), as.Date(end_date), by = "day") :
'from' must be of length 1
How should I do this? I am open to using other packages rather than data.table if they work better.

Here, we may need to use by as sequence of rows
library(data.table)
setDT(df)[, .(date = seq(as.Date(start_date), as.Date(end_date),
by = 'day')), .(rn = seq_len(nrow(df)), name, value)][, rn := NULL][]
Or create a list column by looping over corresponding elements of 'start_date', 'end_date' to create a sequence of dates in Map and then unnest the list
library(tidyr)
library(magrittr)
setDT(df)[, .(name, date = Map(seq, MoreArgs = list(by = '1 day'),
as.Date(start_date), as.Date(end_date)), value)] %>%
unnest(date)
# A tibble: 731 x 3
# name date value
# <chr> <date> <dbl>
# 1 A 2020-01-23 8.1
# 2 A 2020-01-24 8.1
# 3 A 2020-01-25 8.1
# 4 A 2020-01-26 8.1
# 5 A 2020-01-27 8.1
# 6 A 2020-01-28 8.1
# 7 A 2020-01-29 8.1
# 8 A 2020-01-30 8.1
# 9 A 2020-01-31 8.1
#10 A 2020-02-01 8.1
# … with 721 more rows

Another approach using purrr
df <- data.frame(name = c("A", "A", "A", "B", "B"),
start_date = c("2020-01-23", "2019-10-15", "2019-07-28", "2020-03-15", "2019-04-23"),
end_date = c("2020-05-15", "2020-01-27", "2019-10-17", "2020-07-25", "2020-02-13"),
value = c(8.1, 3.3, 9.1, 9.4, 15.3))
library(dplyr)
library(purrr)
# function take in the name, start, end, value and generate a df fill as wanted
generate_fill <- function(name, start, end, value) {
tibble(name = name,
date = seq(as.Date(start), as.Date(end), by = "1 day"),
value = value)
}
# Map the function to original df and combine the result
bind_rows(
pmap(list(df[["name"]], df[["start_date"]], df[["end_date"]], df[["value"]]),
generate_fill))
Output
# A tibble: 731 x 3
name date value
<chr> <date> <dbl>
1 A 2020-01-23 8.1
2 A 2020-01-24 8.1
3 A 2020-01-25 8.1
4 A 2020-01-26 8.1
5 A 2020-01-27 8.1
6 A 2020-01-28 8.1
7 A 2020-01-29 8.1
8 A 2020-01-30 8.1
9 A 2020-01-31 8.1
10 A 2020-02-01 8.1
# … with 721 more rows

Group by weekly data and summarize by month in R with dplyr

I have a dataset of weekly mortgage rate data.
The data looks very simple:
library(tibble)
library(lubridate)
df <- tibble(
Date = as_date(c("2/7/2008 ", "2/14/2008", "2/21/2008", "2/28/2008", "3/6/2008"), format = "%m/%d/%Y"),
Rate = c(5.67, 5.72, 6.04, 6.24, 6.03)
)
I am trying to group it and summarize by month.
This blogpost and this answer are not what I want, because they just add the month column.
They give me the output:
month Date summary_variable
2008-02-01 2008-02-07 5.67
2008-02-01 2008-02-14 5.72
2008-02-01 2008-02-21 6.04
2008-02-01 2008-02-28 6.24
My desired output (ideally the last day of the month):
Month Average rate
2/28/2008 6
3/31/2008 6.1
4/30/2008 5.9
In the output above I put random numbers, not real calculations.

We can get the month extracted as column and do a group by mean
library(dplyr)
library(lubridate)
library(zoo)
df1 %>%
group_by(Month = as.Date(as.yearmon(mdy(DATE)), 1)) %>%
summarise(Average_rate = mean(MORTGAGE30US))
-output
# A tibble: 151 x 2
# Month Average_rate
# <date> <dbl>
# 1 2008-02-29 5.92
# 2 2008-03-31 5.97
# 3 2008-04-30 5.92
# 4 2008-05-31 6.04
# 5 2008-06-30 6.32
# 6 2008-07-31 6.43
# 7 2008-08-31 6.48
# 8 2008-09-30 6.04
# 9 2008-10-31 6.2
#10 2008-11-30 6.09
# … with 141 more rows

Is there a function in R that will sum values based on Date of Year?

I have a data table (Precip15) consisting of columns of precipitation, date of year (DOY), and Date_Time in POSIXct format. I need to be able to see the total precipitation (Rain_cm) for every day recorded. Any suggestions?
An example of the data table format looks like this:
DOY Rain Rain_cm Date_Time
179 6 0.6 2019-06-28 15:00:00
179 0 NA 2019-06-28 15:15:00
179 2 0.2 2019-06-28 16:45:00
180 0 NA 2019-06-29 10:00:00
180 10.2 1.2 2019-06-29 10:15:00
180 2 0.2 2019-06-29 13:00:00
I need it to look like this:
DOY Rain_cm
179 0.8
180 1.4
or possibly:
Date Rain_cm
2019-06-28 0.8
2019-06-29 1.4
Thanks in advance for any help!

Here are some base R solutions using the data frame DF defined reproducibly in the Note at the end. Solutions based on dplyr, data.table or zoo packages would be possible as well.
1) aggregate aggregate on DOY or on Date (defined in the transform statement below) depending on what you want. Note that aggregate automatically removes rows with NAs.
aggregate(Rain_cm ~ DOY, DF, sum)
## DOY Rain_cm
## 1 179 0.8
## 2 180 1.4
DF2 <- transform(DF, Date = as.Date(Date_Time))
aggregate(Rain_cm ~ Date, DF2, sum)
## Date Rain_cm
## 1 2019-06-28 0.8
## 2 2019-06-29 1.4
2) rowsum Another base R solution is rowsum returning a one column matrix with the row names being the value of the grouping variable. DF2 is from (1).
with(na.omit(DF), rowsum(Rain_cm, DOY))
## [,1]
## 179 0.8
## 180 1.4
with(na.omit(DF2), rowsum(Rain_cm, Date))
## [,1]
## 2019-06-28 0.8
## 2019-06-29 1.4
3) tapply Another base R approach is tapply. This produces a named numeric vector. DF2 is from (1).
with(DF, tapply(Rain_cm, DOY, sum, na.rm = TRUE))
## 179 180
## 0.8 1.4
with(DF2, tapply(Rain_cm, Date, sum, na.rm = TRUE))
## 2019-06-28 2019-06-29
## 0.8 1.4
4) xtabs xtabs can be used to form an xtabs table object. DF2 is from (1).
xtabs(Rain_cm ~ DOY, DF)
## DOY
## 179 180
## 0.8 1.4
xtabs(Rain_cm ~ Date, DF2)
## Date
## 2019-06-28 2019-06-29
## 0.8 1.4
Note
The data in reproducible form is assumed to be:
Lines <- "DOY Rain Rain_cm Date_Time
179 6 0.6 2019-06-28 15:00:00
179 0 NA 2019-06-28 15:15:00
179 2 0.2 2019-06-28 16:45:00
180 0 NA 2019-06-29 10:00:00
180 10.2 1.2 2019-06-29 10:15:00
180 2 0.2 2019-06-29 13:00:00"
L <- readLines(textConnection(Lines))
DF <- read.csv(text = gsub(" +", ",", Lines))

df <- tribble(
~DOY, ~Rain, ~Rain_cm, ~Date_Time
, 179 , 6 , 0.6 , "2019-06-28 15:00:00"
, 179 , 0 , NA , "2019-06-28 15:15:00"
, 179 , 2 , 0.2 , "2019-06-28 16:45:00"
, 180 , 0 , NA , "2019-06-29 10:00:00"
, 180 , 10.2 , 1.2 , "2019-06-29 10:15:00"
, 180 , 2 , 0.2 , "2019-06-29 13:00:00"
)
df %>%
mutate(Date_Time = ymd_hms(Date_Time)) %>%
mutate(Date = as.Date(Date_Time)) %>%
group_by(Date) %>%
summarise(perDate = sum(Rain_cm, na.rm = TRUE))
Date perDate
<date> <dbl>
1 2019-06-28 0.8
2 2019-06-29 1.4

You can use the aggregate and cut functions to calculate your total daily precip values. The following code will provide you with the desired results:
precipTotals <- aggreate(df$Rain_cm ~ cut(df$Date_Time, breaks = "day"), x = df,
FUN = sum, na.rm = TRUE)
Make sure your precip columns are as.numeric() and your Date_Time is in as.POSIXct() format and this will work for you.

Calculate average of month and replace values of other column

I have a dataframe as given below:
vdate=c("12-04-2015","13-04-2015","14-04-2015","15-04-2015","12-05-2015","13-05-2015","14-05-2015"
,"15-05-2015","12-06-2015","13-06-2015","14-06-2015","15-06-2015")
month=c(4,4,4,4,5,5,5,5,6,6,6,6)
col1=c(12,12.4,14.3,3,5.3,1.8,7.6,4.5,7.6,10.7,12,15.7)
df=data.frame(vdate,month,col1)
Below is the column which contains value based on some calculation:
pvar=c(8.4,2.4,12,14.4,2.3,3.5,7.8,5,16,5.4,18,18.4)
Now I want to replace pvar value if its value less than the average value for that particular month.
For example,
for month 4,
Average value of pvar is 9.3 ((8.4+2.4+12+14.4)/4).
Then replace all the value in pvar which is less than avg for month 4 that is (8.4 &2.4).
Pvar value would be 9.3,9.3,12,14.4
I need to do this for all the values in pvar.

A base R solution would be to use ave. Note that we first need to convert the date column to actual date in order to extract the month (strsplit or regex can also do it but I prefer to have it set as a proper date), i.e.
df$vdate <- as.POSIXct(df$vdate, format = '%d-%m-%Y')
with(df, ave(pvar, format(vdate, '%m'), FUN = function(i) replace(i, i < mean(i), mean(i))))
#[1] 9.30 9.30 12.00 14.40 4.65 4.65 7.80 5.00 16.00 14.45 18.00 18.40
As per your edit, I will use dplyr to tackle it as it might be more readable. There are actually two ways I came up with.
First: Create an extra grouping variable that will put all the months you need to alter the values in the same group and replace from there, i.e.
library(dplyr)
cbind(df, pvar) %>%
group_by(grp = cumsum(!month %in% c(4, 5))+1, month) %>%
mutate(pvar = replace(pvar, pvar < mean(pvar), mean(pvar))) %>%
ungroup() %>%
select(-grp)
Second: Filter the months you need, do the calculations. Then filter the months you don't need, create again the pvar but without changing anything (necessary for binding the rows) and bind the rows, i.e.
bind_rows(
cbind(df, pvar) %>%
filter(month %in% c(4, 5)) %>%
group_by(month) %>%
mutate(pvar = replace(pvar, pvar < mean(pvar), mean(pvar))),
cbind(df, pvar) %>%
filter(!month %in% c(4, 5))
)
Both the above give,
vdate month col1 pvar
<fct> <dbl> <dbl> <dbl>
1 12-04-2015 4. 12.0 12.0
2 13-04-2015 4. 12.4 12.4
3 14-04-2015 4. 14.3 14.3
4 15-04-2015 4. 3.00 10.4
5 12-05-2015 5. 5.30 5.30
6 13-05-2015 5. 1.80 4.80
7 14-05-2015 5. 7.60 7.60
8 15-05-2015 5. 4.50 4.80
9 12-06-2015 6. 7.60 7.60
10 13-06-2015 6. 10.7 10.7
11 14-06-2015 6. 12.0 12.0
12 15-06-2015 6. 15.7 15.7

A dplyr based solution could be :
#Additional condition has been added to check if month != 6
cbind(df, pvar) %>%
group_by(month) %>%
mutate(pvar = ifelse(pvar < mean(pvar) & month != 6, mean(pvar), pvar)) %>%
as.data.frame()
# vdate month col1 pvar
# 1 12-04-2015 4 12.0 9.30
# 2 13-04-2015 4 12.4 9.30
# 3 14-04-2015 4 14.3 12.00
# 4 15-04-2015 4 3.0 14.40
# 5 12-05-2015 5 5.3 4.65
# 6 13-05-2015 5 1.8 4.65
# 7 14-05-2015 5 7.6 7.80
# 8 15-05-2015 5 4.5 5.00
# 9 12-06-2015 6 7.6 16.00
# 10 13-06-2015 6 10.7 5.40
# 11 14-06-2015 6 12.0 18.00
# 12 15-06-2015 6 15.7 18.40
Data
vdate=c("12-04-2015","13-04-2015","14-04-2015","15-04-2015","12-05-2015",
"13-05-2015","14-05-2015","15-05-2015","12-06-2015","13-06-2015",
"14-06-2015","15-06-2015")
month=c(4,4,4,4,5,5,5,5,6,6,6,6)
col1=c(12,12.4,14.3,3,5.3,1.8,7.6,4.5,7.6,10.7,12,15.7)
df=data.frame(vdate,month,col1)
pvar=c(8.4,2.4,12,14.4,2.3,3.5,7.8,5,16,5.4,18,18.4)

Rolling sums for groups with uneven time gaps

Here's the tweak to my previously posted question. Here's my data:
set.seed(3737)
DF2 = data.frame(user_id = c(rep(27, 7), rep(11, 7)),
date = as.Date(rep(c('2016-01-01', '2016-01-03', '2016-01-05', '2016-01-07', '2016-01-10', '2016-01-14', '2016-01-16'), 2)),
value = round(rnorm(14, 15, 5), 1))
user_id date value
27 2016-01-01 15.0
27 2016-01-03 22.4
27 2016-01-05 13.3
27 2016-01-07 21.9
27 2016-01-10 20.6
27 2016-01-14 18.6
27 2016-01-16 16.4
11 2016-01-01 6.8
11 2016-01-03 21.3
11 2016-01-05 19.8
11 2016-01-07 22.0
11 2016-01-10 19.4
11 2016-01-14 17.5
11 2016-01-16 19.3
This time, I'd like to calculate cumulative sum of a value for each user_id for the specified time period'; e.g. last 7, 14 days. The desirable solution would look like this:
user_id date value v_minus7 v_minus14
27 2016-01-01 15.0 15.0 15.0
27 2016-01-03 22.4 37.4 37.4
27 2016-01-05 13.3 50.7 50.7
27 2016-01-07 21.9 72.6 72.6
27 2016-01-10 20.6 78.2 93.2
27 2016-01-14 18.6 61.1 111.8
27 2016-01-16 16.4 55.6 113.2
11 2016-01-01 6.8 6.8 6.8
11 2016-01-03 21.3 28.1 28.1
11 2016-01-05 19.8 47.9 47.9
11 2016-01-07 22.0 69.9 69.9
11 2016-01-10 19.4 82.5 89.3
11 2016-01-14 17.5 58.9 106.8
11 2016-01-16 19.3 56.2 119.3
Ideally, I'd like to use dplyr for this, but other packages would be fine.

logic : first group by user_id, followed by date. Now for each subset of data, we are checking which all dates lie between the current date and 7/14 days back using between() which returns a logical vector.
Based on this logical vector I add the value column
library(data.table)
setDT(DF2)[, `:=`(v_minus7 = sum(DF2$value[DF2$user_id == user_id][between(DF2$date[DF2$user_id == user_id], date-7, date, incbounds = TRUE)]),
v_minus14 = sum(DF2$value[DF2$user_id == user_id][between(DF2$date[DF2$user_id == user_id], date-14, date, incbounds = TRUE)])),
by = c("user_id", "date")][]
# user_id date value v_minus7 v_minus14
#1: 27 2016-01-01 15.0 15.0 15.0
#2: 27 2016-01-03 22.4 37.4 37.4
#3: 27 2016-01-05 13.3 50.7 50.7
#4: 27 2016-01-07 21.9 72.6 72.6
#5: 27 2016-01-10 20.6 78.2 93.2
#6: 27 2016-01-14 18.6 61.1 111.8
#7: 27 2016-01-16 16.4 55.6 113.2
#8: 11 2016-01-01 6.8 6.8 6.8
#9: 11 2016-01-03 21.3 28.1 28.1
#10: 11 2016-01-05 19.8 47.9 47.9
#11: 11 2016-01-07 22.0 69.9 69.9
#12: 11 2016-01-10 19.4 82.5 89.3
#13: 11 2016-01-14 17.5 58.9 106.8
#14: 11 2016-01-16 19.3 56.2 119.3
# from alexis_laz answer.
ff = function(date, value, minus){
cs = cumsum(value)
i = findInterval(date - minus, date, rightmost.closed = TRUE)
w = which(as.logical(i))
i[w] = cs[i[w]]
cs - i
}
setDT(DF2)
DF2[, `:=`( v_minus7 = ff(date, value, 7),
v_minus14 = ff(date, value, 14)), by = c("user_id")]

You can use rollapply from zoo once you fill out the missing dates first:
library(dplyr)
library(zoo)
set.seed(3737)
DF2 = data.frame(user_id = c(rep(27, 7), rep(11, 7)),
date = as.Date(rep(c('2016-01-01', '2016-01-03', '2016-01-05', '2016-01-07', '2016-01-10', '2016-01-14', '2016-01-16'), 2)),
value = round(rnorm(14, 15, 5), 1))
all_combinations <- expand.grid(user_id=unique(DF2$user_id),
date=seq(min(DF2$date), max(DF2$date), by="day"))
res <- DF2 %>%
merge(all_combinations, by=c('user_id','date'), all=TRUE) %>%
group_by(user_id) %>%
arrange(date) %>%
mutate(v_minus7=rollapply(value, width=8, FUN=function(x) sum(x, na.rm=TRUE), partial=TRUE, align='right'),
v_minus14=rollapply(value, width=15, FUN=function(x) sum(x, na.rm=TRUE), partial=TRUE, align='right')) %>%
filter(!is.na(value))

Here is another idea with findInterval to minimize comparisons and operations. First define a function to accomodate the basic part ignoring the grouping. The following function computes the cumulative sum, and subtracts the cumulative sum at each position from the one at its respective past date:
ff = function(date, value, minus)
{
cs = cumsum(value)
i = findInterval(date - minus, date, left.open = TRUE)
w = which(as.logical(i))
i[w] = cs[i[w]]
cs - i
}
And apply it by group:
do.call(rbind,
lapply(split(DF2, DF2$user_id),
function(x) data.frame(x,
minus7 = ff(x$date, x$value, 7),
minus14 = ff(x$date, x$value, 14))))
# user_id date value minus7 minus14
#11.8 11 2016-01-01 6.8 6.8 6.8
#11.9 11 2016-01-03 21.3 28.1 28.1
#11.10 11 2016-01-05 19.8 47.9 47.9
#11.11 11 2016-01-07 22.0 69.9 69.9
#11.12 11 2016-01-10 19.4 82.5 89.3
#11.13 11 2016-01-14 17.5 58.9 106.8
#11.14 11 2016-01-16 19.3 56.2 119.3
#27.1 27 2016-01-01 15.0 15.0 15.0
#27.2 27 2016-01-03 22.4 37.4 37.4
#27.3 27 2016-01-05 13.3 50.7 50.7
#27.4 27 2016-01-07 21.9 72.6 72.6
#27.5 27 2016-01-10 20.6 78.2 93.2
#27.6 27 2016-01-14 18.6 61.1 111.8
#27.7 27 2016-01-16 16.4 55.6 113.2
The above apply-by-group operation can, of course, be replaced by any method prefereable.

Here are some approaches using zoo.
1) Define a function sum_last that given a zoo object takes the sum of the values whose times are within k days of the last day in the series and define a roll function which applies it to an entire series. Then use ave to apply roll to each user_id once for k=7 and once for k=14.
Note that this makes use of the coredata argument to rollapply that was introduced in the most recent version of zoo so be sure you don't have an earlier version.
library(zoo)
# compute sum of values within k time units of last time point
sum_last <- function(z, k) {
tt <- time(z)
sum(z[tt > tail(tt, 1) - k])
}
# given indexes ix run rollapplyr on read.zoo(DF2[ix, -1])
roll <- function(ix, k) {
rollapplyr(read.zoo(DF2[ix, -1]), k, sum_last, coredata = FALSE, partial = TRUE, k = k)
}
nr <- nrow(DF2)
transform(DF2,
v_minus7 = ave(1:nr, user_id, FUN = function(x) roll(x, 7)),
v_minus14 = ave(1:nr, user_id, FUN = function(x) roll(x, 14)))
2) An alternative would be to replace roll with the version shown below. This converts DF2[ix, -1] to "zoo" and merges it with a zero width grid with filled-in gaps. Then rollapply is applied to that and we use window to subset it back to the original times.
roll <- function(ix, k) {
z <- read.zoo(DF2[ix, -1])
g <- zoo(, seq(start(z), end(z), "day"))
m <- merge(z, g, fill = 0)
r <- rollapplyr(m, k, sum, partial = TRUE)
window(r, time(z))
}

Try runner package if you want to calculate on time/date windows. Go to github documentation and check Windows depending on date section.
library(runner)
DF2 %>%
group_by(user_id) %>%
mutate(
v_minus7 = sum_run(value, 7, idx = date),
v_minus14 = sum_run(value, 14, idx = date)
)
Benchmark here
library(data.table)
library(dplyr)
library(zoo)
library(tbrf)
set.seed(3737)
DF2 = data.frame(user_id = c(rep(27, 7), rep(11, 7)),
date = as.Date(rep(c('2016-01-01', '2016-01-03', '2016-01-05', '2016-01-07', '2016-01-10', '2016-01-14', '2016-01-16'), 2)),
value = round(rnorm(14, 15, 5), 1))
# example 1
data_table <- function(DF2) {
setDT(DF2)[, `:=`(v_minus7 = sum(DF2$value[DF2$user_id == user_id][data.table::between(DF2$date[DF2$user_id == user_id], date-7, date, incbounds = TRUE)]),
v_minus14 = sum(DF2$value[DF2$user_id == user_id][data.table::between(DF2$date[DF2$user_id == user_id], date-14, date, incbounds = TRUE)])),
by = c("user_id", "date")][]
}
# example 2
dplyr_grid <- function(DF2) {
all_combinations <- expand.grid(user_id=unique(DF2$user_id),
date=seq(min(DF2$date), max(DF2$date), by="day"))
DF2 %>%
merge(all_combinations, by=c('user_id','date'), all=TRUE) %>%
group_by(user_id) %>%
arrange(date) %>%
mutate(v_minus7=rollapply(value, width=8, FUN=function(x) sum(x, na.rm=TRUE), partial=TRUE, align='right'),
v_minus14=rollapply(value, width=15, FUN=function(x) sum(x, na.rm=TRUE), partial=TRUE, align='right')) %>%
filter(!is.na(value))
}
# example 3
dplyr_tbrf <- function(DF2) {
DF2 %>%
group_by(user_id) %>%
tbrf::tbr_sum(value, date, unit = "days", n = 7) %>%
arrange(user_id, date) %>%
rename(v_minus7 = sum) %>%
tbrf::tbr_sum(value, date, unit = "days", n = 14) %>%
rename(v_minus14 = sum)
}
# example 4
runner <- function(DF2) {
DF2 %>%
group_by(user_id) %>%
mutate(
v_minus7 = sum_run(value, 7, idx = date),
v_minus14 = sum_run(value, 14, idx = date)
)
}
microbenchmark::microbenchmark(
runner = runner(DF2),
data.table = data_table(DF2),
dplyr = dplyr_tbrf(DF2),
dplyr_tbrf = dplyr_tbrf(DF2),
times = 100L
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# runner 1.478331 1.797512 2.350416 2.083680 2.559875 9.181675 100
# data.table 5.432618 5.970619 7.107540 6.424862 7.563405 13.674661 100
# dplyr 63.841710 73.652023 86.228112 79.861760 92.304231 256.841078 100
# dplyr_tbrf 60.582381 72.511075 90.175891 80.435700 92.865997 307.454643 100

Here is a new option using dplyr and tbrf
library(tbrf)
library(dplyr)
set.seed(3737)
DF2 = data.frame(user_id = c(rep(27, 7), rep(11, 7)),
date = as.Date(rep(c('2016-01-01', '2016-01-03', '2016-01-05', '2016-01-07', '2016-01-10', '2016-01-14', '2016-01-16'), 2)),
value = round(rnorm(14, 15, 5), 1))
DF2 %>%
group_by(user_id) %>%
tbrf::tbr_sum(value, date, unit = "days", n = 7) %>%
arrange(user_id, date) %>%
rename(v_minus7 = sum) %>%
tbrf::tbr_sum(value, date, unit = "days", n = 14) %>%
rename(v_minus14 = sum)
Creates a tibble:
# A tibble: 14 x 5
user_id date value v_minus7 v_minus14
<dbl> <date> <dbl> <dbl> <dbl>
1 11 2016-01-01 6.8 6.8 21.8
2 27 2016-01-01 15 15 21.8
3 11 2016-01-03 21.3 28.1 65.5
4 27 2016-01-03 22.4 37.4 65.5
5 11 2016-01-05 19.8 47.9 98.6
6 27 2016-01-05 13.3 50.7 98.6
7 11 2016-01-07 22 69.9 142.
8 27 2016-01-07 21.9 72.6 142.
9 11 2016-01-10 19.4 82.5 182.
10 27 2016-01-10 20.6 78.2 182.
11 11 2016-01-14 17.5 58.9 219.
12 27 2016-01-14 18.6 61.1 219.
13 11 2016-01-16 19.3 56.2 232.
14 27 2016-01-16 16.4 55.6 232.
I suspect this isn't the fastest solution with larger datasets, but it works well in dplyr chains.

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Filter grouped data - r

Related

How to fill dates between two dates [duplicate]

Group by weekly data and summarize by month in R with dplyr

Is there a function in R that will sum values based on Date of Year?

Calculate average of month and replace values of other column

Rolling sums for groups with uneven time gaps

Categories

Resources