Calculating conditional cumulative time - r

Following the pointers from this question.
I'd like to calculate the cumulative time for all the Cats, by considering their respective last toggle status.
EDIT:
I'd also want to check if the FIRST Toggle status of a Cat is Off and if it is so, for that specific cat, the time from midnight 00:00:00 till this first FIRST Off time should be added to its total conditional cumulative ontime.
Sample data:
Time Cat Toggle
1 05:12:09 36 On
2 05:12:12 26R Off # First Toggle of this Cat happens to be Off, Condition met
3 05:12:15 26R On
4 05:12:16 26R Off
5 05:12:18 99 Off # Condition met
6 05:12:18 99 On
7 05:12:24 36 Off
8 05:12:26 36 On
9 05:12:29 80 Off # Condition met
10 05:12:30 99 Off
11 05:12:31 95 Off # Condition met
12 05:12:32 36 Off
Desired sample output:
Cat Time(Secs)
1 36 21
2 26R 18733 # (=1+18732), 18732 secs to be added = total Sec from midnight till 05:12:12
3 99 18750 # (=12+18738), 18738 secs to be added = total Sec from midnight till 05:12:18
4 .. ..
Any sort of help is appreciated.

using base R:
df$Time=as.POSIXct(df$Time,,"%H:%M:%S")
stack(by(df,df$Cat,function(x)sum(c(0,diff(x$Time))*(x$Toggle=="Off"))))
values ind
1 1 26R
2 21 36
3 0 80
4 0 95
5 12 99

One can use as.difftime function to convert time from H:M:S format to seconds. Then for each On statue find the lead record in order to calculate interval of time lapsed from On.
library(dplyr)
# Convert Time in seconds.
df %>% mutate(Time = as.difftime(Time, units = "secs")) %>%
group_by(Cat) %>%
mutate(TimeInterVal = ifelse(Toggle == "On", (lead(Time) - Time), 0)) %>%
summarise(TimeInterVal = sum(TimeInterVal))
# # A tibble: 5 x 2
# Cat TimeInterVal
# <chr> <dbl>
# 1 26R 1.00
# 2 36 21.0
# 3 80 0
# 4 95 0
# 5 99 12.0
Note: On can consider arranging data on Time ensure rows are ordered on time.
Data:
df <- read.table(text ="
Time Cat Toggle
1 05:12:09 36 On
2 05:12:12 26R Off
3 05:12:15 26R On
4 05:12:16 26R Off
5 05:12:18 99 Off
6 05:12:18 99 On
7 05:12:24 36 Off
8 05:12:26 36 On
9 05:12:29 80 Off
10 05:12:30 99 Off
11 05:12:31 95 Off
12 05:12:32 36 Off",
header = TRUE, stringsAsFactors = FALSE)

A possible solution using data.table:
# load the 'data.table'-package, convert 'df' to a 'data.table'
# and 'Time'-column to a time-format
library(data.table)
setDT(df)[, Time := as.ITime(Time)]
# calculate the time-difference
df[, .(time.diff = sum((shift(Time, type = 'lead') - Time) * (Toggle == 'On'), na.rm = TRUE))
, by = Cat]
which gives:
Cat time.diff
1: 36 21
2: 26R 1
3: 99 12
4: 80 0
5: 95 0
In respons to your question in the comments, you could do:
# create a new data.table with midnigth times for the categories where
# the first 'Toggle' is on "Off"
df0 <- df[, .I[first(Toggle) == "Off"], by = Cat
][, .(Time = as.ITime("00:00:00"), Cat = unique(Cat), Toggle = "On")]
# bind that to the original data.table; order on 'Cat' and 'Time'
# and then do the same calculation
rbind(df, df0)[order(Cat, Time)
][, .(time.diff = sum((shift(Time, type = 'lead') - Time) * (Toggle == 'On'), na.rm = TRUE))
, by = Cat]
which gives:
Cat time.diff
1: 26R 18733
2: 36 21
3: 80 18749
4: 95 18751
5: 99 18750
An alternative with base R (only original question):
df$Time <- as.POSIXct(df$Time, format = "%H:%M:%S")
stack(sapply(split(df, df$Cat),
function(x) sum(diff(x[["Time"]]) * (head(x[["Toggle"]],-1) == 'On'))))
which gives:
values ind
1 1 26R
2 21 36
3 0 80
4 0 95
5 12 99
Or with the tidyverse (only original question):
library(dplyr)
library(lubridate)
df %>%
mutate(Time = lubridate::hms(Time)) %>%
group_by(Cat) %>%
summarise(time.diff = sum(diff(Time) * (head(Toggle, -1) == 'On'),
na.rm = TRUE))

Related

Calculate number of pending tasks at given time points (ideally with dplyr)

I have a database containing a list of events. Each event has an associated start date, and a date when the event ended or was completed, eg:
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)
> dataset
# A tibble: 25 x 3
eventid start_date completed_date
<int> <date> <date>
1 57 2011-01-14 2013-01-07
2 97 2011-01-21 2011-03-03
3 58 2011-01-26 2011-02-05
4 25 2011-03-22 2013-07-20
5 8 2011-04-20 2012-07-16
6 81 2011-04-26 2013-03-04
7 42 2011-05-02 2012-01-16
8 77 2011-05-03 2012-08-14
9 78 2011-05-21 2013-09-26
10 49 2011-05-22 2013-01-04
# ... with 15 more rows
>
I am trying to produce a rolling "snapshot" of how many tasks were pending a different points in time, e.g. month by month. Expected result:
# A tibble: 25 x 2
month count
<date> <int>
1 2011-01-01 0
2 2011-02-01 3
3 2011-03-01 2
4 2011-04-01 2
5 2011-05-01 4
6 2011-06-01 8
I have attempted to group my variables using group_by(period=floor_date(start_date,"month")), but I'm a bit stuck and would appreciate a pointer in the right direction!
I would prefer a solution using dplyr if possible.
Thanks!
You can expand rows for each month included in the range of dates with map2 from purrr. map2 will iterate over multiple inputs simultaneously. In this case, it will iterate through the start and end dates at the same time.
In each iteration, if will create a monthly sequence using seq (or seq.Date) from start to end month (determined from floor_date). The result is nested for each row of data (since one row can have multiple months in the sequence). So, unnest is needed afterwards.
The transmute will add a new variable called month_year (and drop the old ones) and use substr to extract the year and month only (no day). This is the first through seventh character of the date.
Then, you can group_by the month-year and count up the number of pending projects for each month_year.
I included set.seed to reproduce from data below.
library(dplyr)
library(tidyr)
library(purrr)
library(lubridate)
dataset %>%
mutate(month = map2(floor_date(start_date, "month"),
floor_date(completed_date, "month"),
seq.Date,
by = "month")) %>%
unnest(month) %>%
transmute(month_year = substr(month, 1, 7)) %>%
group_by(month_year) %>%
summarise(count = n())
Output
month_year count
<chr> <int>
1 2011-01 1
2 2011-02 3
3 2011-03 9
4 2011-04 10
5 2011-05 13
6 2011-06 15
7 2011-07 16
8 2011-08 18
9 2011-09 19
10 2011-10 20
# … with 22 more rows
If you want to exclude the completed month (except when start month and completed month are the same, if that can exist), you can subtract 1 month from the sequence of months created. In this case, you can use pmax so that if both start and end months are the same, it will still count the month).
Here is the modified mutate with map2:
mutate(month = map2(floor_date(start_date, "month"),
pmax(floor_date(completed_date, "month") - 1, floor_date(start_date, "month")),
seq.Date,
by = "month"))
Data
set.seed(123)
dataset <- tibble(
eventid = sample(1:100, 25, replace=TRUE),
start_date = sample(seq(as.Date('2011/01/01'), as.Date('2012/01/01'), by="day"), 25),
completed_date = sample(seq(as.Date('2012/01/01'), as.Date('2014/01/01'), by="day"), 25)
)

R: Velocity/Aggregation - excess unique counts of column B per column A within certain time periods?

I'm exploring ways to identify when a count exceeds a certain threshold within a certain time period.
For example, let's say we have 4 columns - Transaction, Time, Email and CC. Throughout the data set, we want to identify WHICH user emails (Email) are involved with more than 2 credit cards (CC) within ANY 60 minute period. Ideally, we would also like to know at WHAT (Transaction) this threshold is broken.
The end goal is to know something like this -
'CBC' used its 3rd (CC) in <= 60 minutes at 'Transaction' 50.
Simulated data:
library(stringi)
set.seed(123)
CC <- sample(1000:1199, 100, replace = TRUE)
Email <- stri_rand_strings(100, 3, pattern = "[A-D]")
Time <- as.POSIXct("2020-01-01 00:00") + sort(sample(1:10000, 100))
DF <- data.frame(Time, Email, CC)
DF <- tibble::rowid_to_column(DF, "Transaction")
> head(DF)
Transaction Time Email CC
1 1 2020-01-01 00:00:05 CBB 1057
2 2 2020-01-01 00:04:40 DBD 1157
3 3 2020-01-01 00:08:11 DCB 1081
4 4 2020-01-01 00:09:39 ADB 1176
5 5 2020-01-01 00:11:39 ADC 1188
6 6 2020-01-01 00:13:45 ACD 1009
This seems to be a pretty unique question, as I'm essentially checking for excess/risky aggregation/counts throughout a data set.
An early dplyr attempt to set this up is as follows -
Counts_DF <- DF %>%
group_by(Email) %>%
mutate(HourInter = cut(Time, breaks = "60 min")) %>%
group_by(Email, HourInter) %>%
summarize(Diff_Cards = n_distinct(CC)) %>%
arrange(desc(Diff_Cards)) %>%
filter(Diff_Cards > 2)
> head(Counts_DF)
# A tibble: 5 x 3
# Groups: Email [5]
Email HourInter Diff_Cards
<fct> <chr> <int>
1 ABB 2020-01-01 01:22:00 3
2 BAC 2020-01-01 00:54:00 3
3 CAB 2020-01-01 00:35:00 3
4 CBC 2020-01-01 00:14:00 3
5 DAB 2020-01-01 01:41:00 3
However, I'm unsure what the 'HourInter' column is really doing and there is clearly no (Transaction) info available.
I've seen other questions for aggregations under static time intervals for just one column, but this is clearly a bit different. Any help with this would be greatly appreciated.
here is a data.table-approach
library( data.table )
#make DF a data.table, set keys for optmised joining
setDT( DF, key = c("Email", "Time" ) )
#get CC used in hour window, and number of unique CC used last hour, by Email by row
DF[ DF,
#get desired values, suppress immediate output using {}
c( "cc_last_hour", "unique_cc_last_hour" ) := {
#temporary subset, with all DF values with the same Email, from the last hour
val = DF[ Email == i.Email &
Time %between% c( i.Time - lubridate::hours(1), i.Time) ]$CC
#get values
list( paste0( val, collapse = "-" ),
uniqueN( val ) )
},
#do the above for each row
by = .EACHI ]
#now subset rows where `unique_cc_used_last_hour` exceeds 2
DF[ unique_cc_last_hour > 2, ]
# Transaction Time Email CC cc_last_hour unique_cc_last_hour
# 1: 66 2020-01-01 01:35:32 AAD 1199 1152-1020-1199 3
# 2: 78 2020-01-01 02:00:16 AAD 1152 1152-1020-1199-1152 3
# 3: 53 2020-01-01 01:24:46 BAA 1096 1080-1140-1096 3
# 4: 87 2020-01-01 02:15:24 BAA 1029 1140-1096-1029 3
# 5: 90 2020-01-01 02:19:30 BAA 1120 1096-1029-1120 3
# 6: 33 2020-01-01 00:55:52 BBC 1031 1196-1169-1031 3
# 7: 64 2020-01-01 01:34:58 BDD 1093 1154-1052-1093 3
# 8: 68 2020-01-01 01:40:07 CBC 1085 1022-1052-1085 3
# 9: 38 2020-01-01 01:03:34 CCA 1073 1090-1142-1073 3
#10: 21 2020-01-01 00:35:54 DBB 1025 1194-1042-1025 3
#11: 91 2020-01-01 02:20:33 DDA 1109 1115-1024-1109 3
update based on OP's comment below
first, create some sample data with a transaction amount
#sample data with an added Amount
library(stringi)
set.seed(123)
CC <- sample(1000:1199, 100, replace = TRUE)
Email <- stri_rand_strings(100, 3, pattern = "[A-D]")
Time <- as.POSIXct("2020-01-01 00:00") + sort(sample(1:10000, 100))
Amount <- sample( 50:100, 100, replace = TRUE )
DF <- data.frame(Time, Email, CC, Amount)
DF <- tibble::rowid_to_column(DF, "Transaction")
here is the code to also calculate the sum of Amount, for the past hour.
A bit more explanation of the functionality of the code
make DF a data.table
'loop' over each row of DF
for each row, take the Email and Time of that row and...
... create a temporary subset of DF, where the Email is the same, and the Time is bewteen Time - 1 hour and Time
join on this this subset, creating new columns "cc_hr", "un_cc_hr" and "am_hr", which get their values from a list. So paste0( val$CC, collapse = "-" ) fills the first column (i.e. "cc_hr"), uniqueN( val$CC ) filles the second col (i.e. "un_cc_hr") and the sum of the amount ("am_hr") gets calculated by sum( val$Amount ).
As you can see, it does not calculate the score for every 60 minute interval, but in stead is defines the end of an interval based on the Time of a Transaction, and then looks for Transactions with the same Email within the hour before Time.
I assumed this is the behaviour you are looking for, and you're not interested in periods where nothing happens.
library( data.table )
#make DF a data.table, set keys for optmised joining
setDT( DF, key = c("Email", "Time" ) )
#self join
DF[ DF,
#get desired values, suppress immediate output using {}
c( "cc_hr", "un_cc_hr", "am_hr" ) := {
#create a temporary subset of DF, named val,
# with all DF's rows with the same Email, from the last hour
val = DF[ Email == i.Email &
Time %between% c( i.Time - lubridate::hours(1), i.Time) ]
#get values
list( paste0( val$CC, collapse = "-" ),
uniqueN( val$CC ),
sum( val$Amount ) ) # <-- calculate the amount of all transactions
},
#do the above for each row of DF
by = .EACHI ]
sample output
#find all Transactions where, in the past hour,
# 1. the number of unique CC used > 2, OR
# 2. the total amount paid > 180
DF[ un_cc_hr > 2 | am_hr > 180, ]
# Transaction Time Email CC Amount cc_hr un_cc_hr am_hr
# 1: 80 2020-01-01 02:03:05 AAB 1021 94 1089-1021 2 194
# 2: 66 2020-01-01 01:35:32 AAD 1199 60 1152-1020-1199 3 209
# 3: 78 2020-01-01 02:00:16 AAD 1152 63 1152-1020-1199-1152 3 272
# 4: 27 2020-01-01 00:40:50 BAA 1080 100 1169-1080 2 186
# 5: 53 2020-01-01 01:24:46 BAA 1096 100 1080-1140-1096 3 259
# 6: 87 2020-01-01 02:15:24 BAA 1029 71 1140-1096-1029 3 230
# 7: 90 2020-01-01 02:19:30 BAA 1120 93 1096-1029-1120 3 264
# 8: 33 2020-01-01 00:55:52 BBC 1031 55 1196-1169-1031 3 171
# 9: 64 2020-01-01 01:34:58 BDD 1093 78 1154-1052-1093 3 212
# 10: 42 2020-01-01 01:08:04 CBC 1052 96 1022-1052 2 194
# 11: 68 2020-01-01 01:40:07 CBC 1085 100 1022-1052-1085 3 294
# 12: 38 2020-01-01 01:03:34 CCA 1073 81 1090-1142-1073 3 226
# 13: 98 2020-01-01 02:40:40 CCC 1121 86 1158-1121 2 183
# 14: 21 2020-01-01 00:35:54 DBB 1025 67 1194-1042-1025 3 212
# 15: 91 2020-01-01 02:20:33 DDA 1109 99 1115-1024-1109 3 236
You could always make the problem a bit easier by extracting the date and hour feature:
library(stringi)
library(tidyverse)
library(lubridate)
set.seed(123)
CC <- sample(1000:1199, 100, replace = TRUE)
Email <- stri_rand_strings(100, 3, pattern = "[A-D]")
Time <- as.POSIXct("2020-01-01 00:00") + sort(sample(1:10000, 100))
DF <- data.frame(Time, Email, CC)
DF <- tibble::rowid_to_column(DF, "Transaction")
DF %>%
mutate(Date = as.Date(Time),
Hour = hour(Time)) %>%
group_by(Date, Hour, Email) %>%
summarise(Diff_Cards = n_distinct(CC)) %>%
filter(Diff_Cards > 2) %>%
arrange(desc(Diff_Cards))

Adding Date and Time values in R

I have the following kind of data in my datafile
DriveNo Date and Time
12 2017-01-31 23:00:00 //Start time of a trip for Driver12
134 2017-01-31 23:00:01
12 2017-01-31 23:10:00 //End time ( 10 min trip)
345 (some date/time)
12 2017-01-31 23:20:00 //Start Time
12 2017-01-31 23:35:00 //End Time (15 min trip)
.
.
.
millions of similar data follow
The total number of data is around 3 million. Now, I need to get the time driven my each of the drivers(there are around 500 drivers).My ideal output would be like
DriveNo TotalTimeDriven
12 35mins
134 ........(in days/hours/mins)
.
.
(for all other Drivers as well)
Above, DriveNo 12 has four entries, suggesting start and end of two rides.Is there an efficient R way to do this?
Data table solution:-
# Sample data
df <- data.table(DriveNo = c(12, 134, 12, 134), Time = c("2017-01-31 23:00:00", "2017-01-31 23:00:01", "2017-01-31 23:10:00", "2017-01-31 23:20:01"))
df[, duration := max(as.POSIXct(Time)) - min(as.POSIXct(Time)), by = DriveNo]
df
DriveNo Time duration
1: 12 2017-01-31 23:00:00 10 mins
2: 134 2017-01-31 23:00:01 20 mins
3: 12 2017-01-31 23:10:00 10 mins
4: 134 2017-01-31 23:20:01 20 mins
range returns the maximum and minimum, and diff subtracts sequential numbers in a vector, so you could just do
aggregate(DateTime ~ DriveNo, df, function(x){diff(range(x))})
## DriveNo DateTime
## 1 12 10
## 2 134 0
or in dplyr,
library(dplyr)
df %>% group_by(DriveNo) %>% summarise(TimeDriven = diff(range(DateTime)))
## # A tibble: 2 × 2
## DriveNo TimeDriven
## <int> <time>
## 1 12 10 mins
## 2 134 0 mins
or in data.table,
library(data.table)
setDT(df)[, .(TimeDriven = diff(range(DateTime))), by = DriveNo]
## DriveNo TimeDriven
## 1: 12 10 mins
## 2: 134 0 mins
To change the units, it may be simpler to call difftime directly.
Data
df <- structure(list(DriveNo = c(12L, 134L, 12L), DateTime = structure(c(1485921600,
1485921601, 1485922200), class = c("POSIXct", "POSIXt"), tzone = "")), class = "data.frame", row.names = c(NA,
-3L), .Names = c("DriveNo", "DateTime"))
For the edit, you can make a variable identifying starts and stops, reshape, and summarise with difftime and sum.
library(tidyverse)
set.seed(47)
drives <- data_frame(DriveNo = sample(rep(1:5, 4)),
DateTime = seq(as.POSIXct("2017-04-13 12:00:00"),
by = '10 min', length.out = 20))
drives %>% str()
#> Classes 'tbl_df', 'tbl' and 'data.frame': 20 obs. of 2 variables:
#> $ DriveNo : int 5 3 4 3 5 1 1 2 3 5 ...
#> $ DateTime: POSIXct, format: "2017-04-13 12:00:00" "2017-04-13 12:10:00" ...
elapsed <- drives %>%
group_by(DriveNo) %>%
mutate(event = rep(c('start', 'stop'), n() / 2),
i = cumsum(event == 'start')) %>%
spread(event, DateTime) %>%
summarise(TimeDriven = sum(difftime(stop, start, units = 'mins')))
elapsed
#> # A tibble: 5 × 2
#> DriveNo TimeDriven
#> <int> <time>
#> 1 1 60 mins
#> 2 2 110 mins
#> 3 3 120 mins
#> 4 4 130 mins
#> 5 5 80 mins
It would be faster to index by recycled Boolean vectors, but in dplyr they get unclassed at some point. In data.table,
library(data.table)
set.seed(47)
drives <- data.table(DriveNo = sample(rep(1:5, 4)),
DateTime = seq(as.POSIXct("2017-04-13 12:00:00"),
by = '10 min', length.out = 20))
elapsed <- drives[, .(TimeDriven = sum(difftime(DateTime[c(FALSE, TRUE)],
DateTime[c(TRUE, FALSE)],
units = 'mins'))),
keyby = DriveNo]
elapsed
#> DriveNo TimeDriven
#> 1: 1 60 mins
#> 2: 2 110 mins
#> 3: 3 120 mins
#> 4: 4 130 mins
#> 5: 5 80 mins
or in base,
set.seed(47)
drives <- data.frame(DriveNo = sample(rep(1:5, 4)),
DateTime = seq(as.POSIXct("2017-04-13 12:00:00"),
by = '10 min', length.out = 20))
elapsed <- aggregate(DateTime ~ DriveNo, drives,
function(x){sum(difftime(x[c(FALSE, TRUE)], x[c(TRUE, FALSE)], units = 'mins'))})
elapsed
#> DriveNo DateTime
#> 1 1 60
#> 2 2 110
#> 3 3 120
#> 4 4 130
#> 5 5 80
All forms will likely have issues if there are an odd number of times for a driver, which is not possible under the assumptions given. If it is, more cleaning is necessary.

Difftime for workdays according to holidayNYSE in R

I'm trying to find difftime for working days only. I want to calculate difftime according to holidayNYSE calendar. When I use the difftime function weekends and holidays are included in the answers, my dataset contaies only data from working days, but when using difftime I have to subtract the non-working days somehow.
A is a vector of 0 and 1, and I want to find the duration of how many days with 0 or 1. Duration for run one are suppose to be 35 and I get 49 (working days from January 1990).
df <- data.frame(Date=(dates), A)
setDT(df)
df <- data.frame(Date=(dates), A)
DF1 <- df[, list(A = unique(A), duration = difftime(max(Date),min(Date), holidayNYSE
(year=setRmetricsOptions(start="1990-01-01", end="2015-31-12")))), by = run]
DF1
run A duration
1: 1 1 49 days
2: 2 0 22 days
3: 3 1 35 days
4: 4 0 27 days
5: 5 1 14 days
---
291: 291 1 6 days
292: 292 0 34 days
293: 293 1 10 days
294: 294 0 15 days
295: 295 1 29 days
An answer to my question without use of difftime:
df <- data.frame(Date=(dates), Value1=bull01)
setDT(df)
df[, run := cumsum(c(1, diff(Value1) !=0))]
duration <- rep(0)
for (i in 1:295){
ind <- which(df$run==i)
a <- df$Date[ind]
duration[i] <- length(a)
}
c <- rep(c(1,0),295)
c <- c[1:295]
df2 <- data.frame(duration, type=c)
> df2
run duration type
1 35 1
2 17 0
3 25 1
4 20 0
5 10 1
---
291 5 1
292 25 0
293 9 1
294 11 0
295 21 1

How to subset data.frame by weeks and then sum?

Let's say I have several years worth of data which look like the following
# load date package and set random seed
library(lubridate)
set.seed(42)
# create data.frame of dates and income
date <- seq(dmy("26-12-2010"), dmy("15-01-2011"), by = "days")
df <- data.frame(date = date,
wday = wday(date),
wday.name = wday(date, label = TRUE, abbr = TRUE),
income = round(runif(21, 0, 100)),
week = format(date, format="%Y-%U"),
stringsAsFactors = FALSE)
# date wday wday.name income week
# 1 2010-12-26 1 Sun 91 2010-52
# 2 2010-12-27 2 Mon 94 2010-52
# 3 2010-12-28 3 Tues 29 2010-52
# 4 2010-12-29 4 Wed 83 2010-52
# 5 2010-12-30 5 Thurs 64 2010-52
# 6 2010-12-31 6 Fri 52 2010-52
# 7 2011-01-01 7 Sat 74 2011-00
# 8 2011-01-02 1 Sun 13 2011-01
# 9 2011-01-03 2 Mon 66 2011-01
# 10 2011-01-04 3 Tues 71 2011-01
# 11 2011-01-05 4 Wed 46 2011-01
# 12 2011-01-06 5 Thurs 72 2011-01
# 13 2011-01-07 6 Fri 93 2011-01
# 14 2011-01-08 7 Sat 26 2011-01
# 15 2011-01-09 1 Sun 46 2011-02
# 16 2011-01-10 2 Mon 94 2011-02
# 17 2011-01-11 3 Tues 98 2011-02
# 18 2011-01-12 4 Wed 12 2011-02
# 19 2011-01-13 5 Thurs 47 2011-02
# 20 2011-01-14 6 Fri 56 2011-02
# 21 2011-01-15 7 Sat 90 2011-02
I would like to sum 'income' for each week (Sunday thru Saturday). Currently I do the following:
Weekending 2011-01-01 = sum(df$income[1:7]) = 487
Weekending 2011-01-08 = sum(df$income[8:14]) = 387
Weekending 2011-01-15 = sum(df$income[15:21]) = 443
However I would like a more robust approach which will automatically sum by week. I can't work out how to automatically subset the data into weeks. Any help would be much appreciated.
First use format to convert your dates to week numbers, then plyr::ddply() to calculate the summaries:
library(plyr)
df$week <- format(df$date, format="%Y-%U")
ddply(df, .(week), summarize, income=sum(income))
week income
1 2011-52 413
2 2012-01 435
3 2012-02 379
For more information on format.date, see ?strptime, particular the bit that defines %U as the week number.
EDIT:
Given the modified data and requirement, one way is to divide the date by 7 to get a numeric number indicating the week. (Or more precisely, divide by the number of seconds in a week to get the number of weeks since the epoch, which is 1970-01-01 by default.
In code:
df$week <- as.Date("1970-01-01")+7*trunc(as.numeric(df$date)/(3600*24*7))
library(plyr)
ddply(df, .(week), summarize, income=sum(income))
week income
1 2010-12-23 298
2 2010-12-30 392
3 2011-01-06 294
4 2011-01-13 152
I have not checked that the week boundaries are on Sunday. You will have to check this, and insert an appropriate offset into the formula.
This is now simple using dplyr. Also I would suggest using cut(breaks = "week") rather than format() to cut the dates into weeks.
library(dplyr)
df %>% group_by(week = cut(date, "week")) %>% mutate(weekly_income = sum(income))
I Googled "group week days into weeks R" and came across this SO question. You mention you have multiple years, so I think we need to keep up with both the week number and also the year, so I modified the answers there as so format(date, format = "%U%y")
In use it looks like this:
library(plyr) #for aggregating
df <- transform(df, weeknum = format(date, format = "%y%U"))
ddply(df, "weeknum", summarize, suminc = sum(income))
#----
weeknum suminc
1 1152 413
2 1201 435
3 1202 379
See ?strptime for all the format abbreviations.
Try rollapply from the zoo package:
rollapply(df$income, width=7, FUN = sum, by = 7)
# [1] 487 387 443
Or, use period.sum from the xts package:
period.sum(xts(df$income, order.by=df$date), which(df$wday %in% 7))
# [,1]
# 2011-01-01 487
# 2011-01-08 387
# 2011-01-15 443
Or, to get the output in the format you want:
data.frame(income = period.sum(xts(df$income, order.by=df$date),
which(df$wday %in% 7)),
week = df$week[which(df$wday %in% 7)])
# income week
# 2011-01-01 487 2011-00
# 2011-01-08 387 2011-01
# 2011-01-15 443 2011-02
Note that the first week shows as 2011-00 because that's how it is entered in your data. You could also use week = df$week[which(df$wday %in% 1)] which would match your output.
This solution is influenced by #Andrie and #Chase.
# load plyr
library(plyr)
# format weeks as per requirement (replace "00" with "52" and adjust corresponding year)
tmp <- list()
tmp$y <- format(df$date, format="%Y")
tmp$w <- format(df$date, format="%U")
tmp$y[tmp$w=="00"] <- as.character(as.numeric(tmp$y[tmp$w=="00"]) - 1)
tmp$w[tmp$w=="00"] <- "52"
df$week <- paste(tmp$y, tmp$w, sep = "-")
# get summary
df2 <- ddply(df, .(week), summarize, income=sum(income))
# include week ending date
tmp$week.ending <- lapply(df2$week, function(x) rev(df[df$week==x, "date"])[[1]])
df2$week.ending <- sapply(tmp$week.ending, as.character)
# week income week.ending
# 1 2010-52 487 2011-01-01
# 2 2011-01 387 2011-01-08
# 3 2011-02 443 2011-01-15
df.index = df['week'] #the the dt variable as index
df.resample('W').sum() #sum using resample
With dplyr:
df %>%
arrange(date) %>%
mutate(week = as.numeric(date - date[1])%/%7) %>%
group_by(week) %>%
summarise(weekincome= sum(income))
Instead of date[1] you can have any date from when you want to start your weekly study.

Resources