ID FROM TO
1881 11/02/2013 11/02/2013
3090 09/09/2013 09/09/2013
1113 24/11/2014 06/12/2014
1110 24/07/2013 25/07/2013
111 25/06/2015 05/09/2015
If I have data.table of vacation dates, FROM and TO, I want to know how many people were on vacation for any given month.
I tried:
dt[, .N, by=.(year(FROM), month(FROM))]
but obviously it would exclude people who were on vacation across two months. ie. someone on vacation FROM JAN TO FEB would only show up in the JAN count and not the FEB count even though they are still on vacation in FEB
The output of the above code showing year, month and number is exactly what I'm looking for otherwise.
year month N
1: 2013 2 17570
2: 2013 9 16924
3: 2014 11 18809
4: 2013 7 16984
5: 2015 6 14401
6: 2015 12 10239
7: 2014 3 19346
8: 2013 5 14864
EDIT: I want every month someone is away on vacation counted. So ID 111 would be counted in June, July, August and Sept.
EDIT 2:
Running uwe's code on the full dataset produces the Total Count column below.
Subsetting the full data set for people on vacation for <= 30 days and > 30 days produces the counts in the respective columns below. These columns added to each other should equal the Total Count and therefore the DIFFERENCE should be 0 but this isn't the case.
month Total count <=30 >30 (<=30) + (>30) DIFFERENCE
01/02/2012 899 4 895 899 0
01/03/2012 3966 2320 1646 3966 0
01/04/2012 8684 6637 2086 8723 39
01/05/2012 10287 7586 2750 10336 49
01/06/2012 12018 9080 3000 12080 62
The OP has not specified what the exact rules are for counting, for instance, how to count if the same ID has multiple non-overlapping periods of vacation in the same month.
The solution below is based on the following rules:
Each ID may appear in more than one row.
For each row, the total number of month between FROM and TO are counted (including the FROM and TO months). E.g., ID 111 is counted in the months of June, July, August, and September 2015.
Vacation on the last and first day of a month are accounted in full, e.g., vacations starting on May 31 and ending on June 1, are counted in both months.
If an ID has multiple periods of vacation in one month it is only counted once.
To verify that the code implements these rule, I had to enhance the sample dataset provided by the OP with additional use cases (see Data section below)
library(data.table)
library(lubridate)
# coerce dt to data.table object and character dates to class Date
setDT(dt)[, (2:3) := lapply(.SD, dmy), .SDcols = 2:3]
# for each row, create sequence of first days of months
dt[, .(month = seq(floor_date(FROM, "months"), TO, by = "months")), by = .(ID, rowid(ID))][
# count the number of unique IDs per month, order result by month
, uniqueN(ID), keyby = month]
month V1
1: 2013-02-01 1
2: 2013-07-01 1
3: 2013-09-01 2
4: 2014-11-01 1
5: 2014-12-01 1
6: 2015-06-01 1
7: 2015-07-01 1
8: 2015-08-01 1
9: 2015-09-01 1
10: 2015-11-01 1
11: 2015-12-01 1
12: 2016-06-01 1
13: 2016-07-01 1
14: 2016-08-01 1
15: 2016-09-01 1
Data
Based on OP's sample dataset but extended by additional use cases:
library(data.table)
dt <- fread(
"ID FROM TO
1881 11/02/2013 11/02/2013
1881 23/02/2013 24/02/2013
3090 09/09/2013 09/09/2013
3091 09/09/2013 09/09/2013
1113 24/11/2014 06/12/2014
1110 24/07/2013 25/07/2013
111 25/06/2015 05/09/2015
111 25/11/2015 05/12/2015
11 25/06/2016 01/09/2016"
)
for the data given above, you will do:
melt(dat,1)[,value:=as.Date(sub("\\d+","20",value),"%d/%m/%Y")][,
seq(value[1],value[2],by="1 month"),by=ID][,.N,by=.(year(V1),month(V1))]
year month N
1: 2013 2 1
2: 2013 9 1
3: 2014 11 1
4: 2014 12 1
5: 2013 7 1
6: 2015 6 1
7: 2015 7 1
8: 2015 8 1
9: 2015 9 1
Related
I have a data table containing daily data. From this data table I want to extract weekly data points obtained each Wednesday. If Wednesday is a holiday, i.e. not available in the data table, the next available data point should be taken.
Here a MWE:
library(data.table)
df <- data.table(date=as.Date(c("2012-06-25","2012-06-26","2012-06-27","2012-06-28","2012-06-29","2012-07-02","2012-07-03","2012-07-05","2012-07-06","2012-07-09","2012-07-10","2012-07-11","2012-07-12","2012-07-13","2012-07-16","2012-07-17","2012-07-18","2012-07-19","2012-07-20")))
df[,weekday:=strftime(date,'%u')]
with output:
date weekday
1: 2012-06-25 1
2: 2012-06-26 2
3: 2012-06-27 3
4: 2012-06-28 4
5: 2012-06-29 5
6: 2012-07-02 1
7: 2012-07-03 2
8: 2012-07-05 4 #here the 4th of July was skipped
9: 2012-07-06 5
10: 2012-07-09 1
11: 2012-07-10 2
12: 2012-07-11 3
13: 2012-07-12 4
14: 2012-07-13 5
15: 2012-07-16 1
16: 2012-07-17 2
17: 2012-07-18 3
18: 2012-07-19 4
19: 2012-07-20 5
My desired result, in this case would be:
date weekday
2012-06-27 3
2012-07-05 4
2012-07-11 3
2012-07-18 3
Is there a more efficient way of obtaining this than going week-by-week via for loop and checking whether the Wednesday data point is included in the data or not? I feel that there must be a better way, so any advice would be highly appreciated!
Working solution (following Imo's suggestion):
df[,weekday:=wday(date)] #faster way to get weekdays, careful: numbers increased by 1 vs strftime
df[,numweek:=floor(as.numeric(date-date[1])/7+1)] #get continuous week numbers extending over end of years
df[df[,.I[which.min(abs(weekday-4.25))],by=.(numweek)]$V1] #gets result
Here is one method using a join on a data.table that finds the position (using .I) of the closest value to 3 (that is not 2, using which.min(abs(as.integer(weekday)-3.25))) by week using.
df[df[, .I[which.min(abs(as.integer(weekday)-3.25))], by=week(date)]$V1]
date weekday
1: 2012-06-27 3
2: 2012-07-05 4
3: 2012-07-11 3
4: 2012-07-18 3
Note that if your real data spans years, then you need to use by=.(week(date), year(date)).
Note also that there is a data.table function wday that will returns the integer day of the week directly. It is 1 greater than the character integer value returned by strftime, so an adjustment would be required if you wanted to use it directly.
From your data.table with a single variable, you'd do
df[, weekday := wday(date)]
df[df[, .I[which.min(abs(weekday-4.25))], by=week(date)]$V1]
date weekday
1: 2012-06-27 4
2: 2012-07-05 5
3: 2012-07-11 4
4: 2012-07-18 4
Note that the dates match those above.
I have a daily revenue time series df from 01-01-2014 to 15-06-2017 and I want to aggregate the daily revenue data to weekly revenue data and do the weekly predictions. Before I aggregate the revenue, I need to create a continuously week variable, which will NOT start from week 1 again when a new year starts. Since 01-01-2014 was not Monday, so I decided to start my first week from 06-01-2014.
My df now looks like this
date year month total
7 2014-01-06 2014 1 1857679.4
8 2014-01-07 2014 1 1735488.0
9 2014-01-08 2014 1 1477269.9
10 2014-01-09 2014 1 1329882.9
11 2014-01-10 2014 1 1195215.7
...
709 2017-06-14 2017 6 1677476.9
710 2017-06-15 2017 6 1533083.4
I want to create a unique week variable starting from 2014-01-06 until the last row of my dataset (1257 rows in total), which is 2017-06-15.
I wrote a loop:
week = c()
for (i in 1:179) {
week = rep(i,7)
print(week)
}
However, the result of this loop is not saved for each iteration. When I type week, it just shows 179,179,179,179,179,179,179
Where is the problem and how can I add 180, 180, 180, 180 after the repeat loop?
And if I will add more new data after 2017-06-15, how can I create the weekly variable automatically depending on my end of row (date)? (In other words, by doing that, I don't need to calculate how many daily observations I have and divide it by 7 and plus the rest of the dates to become the week index)
Thank you!
Does this work
library(lubridate)
#DATA
x = data.frame(date = seq.Date(from = ymd("2014-01-06"),
to = ymd("2017-06-15"), length.out = 15))
#Add year and week for each date
x$week = year(x$date) + week(x$date)/100
#Convert the addition of year and week to factor and then to numeric
x$week_variable = as.numeric(as.factor(x$week))
#Another alternative
x$week_variable2 = floor(as.numeric(x$date - min(x$date))/7) + 1
x
# date week week_variable week_variable2
#1 2014-01-06 2014.01 1 1
#2 2014-04-05 2014.14 2 13
#3 2014-07-04 2014.27 3 26
#4 2014-10-02 2014.40 4 39
#5 2014-12-30 2014.52 5 52
#6 2015-03-30 2015.13 6 65
#7 2015-06-28 2015.26 7 77
#8 2015-09-26 2015.39 8 90
#9 2015-12-24 2015.52 9 103
#10 2016-03-23 2016.12 10 116
#11 2016-06-21 2016.25 11 129
#12 2016-09-18 2016.38 12 141
#13 2016-12-17 2016.51 13 154
#14 2017-03-17 2017.11 14 167
#15 2017-06-15 2017.24 15 180
Here is the answer:
week = c()
for (i in 1:184) {
for (j in 1:7) {
week[j+(i-1)*7] = i
}
}
week = as.data.frame(week)
I created a week variable, and from week 1 to the week 184 (end of my dataset). For each week number, I repeat 7 times because there are 7 days in a week. Later I assigned the week variable to my data frame.
Date Sales
3/11/2017 1
3/12/2017 0
3/13/2017 40
3/14/2017 47
3/15/2017 83
3/16/2017 62
3/17/2017 13
3/18/2017 58
3/19/2017 27
3/20/2017 17
3/21/2017 71
3/22/2017 76
3/23/2017 8
3/24/2017 13
3/25/2017 97
3/26/2017 58
3/27/2017 80
3/28/2017 77
3/29/2017 31
3/30/2017 78
3/31/2017 0
4/1/2017 40
4/2/2017 58
4/3/2017 32
4/4/2017 31
4/5/2017 90
4/6/2017 35
4/7/2017 88
4/8/2017 16
4/9/2017 72
4/10/2017 39
4/11/2017 8
4/12/2017 88
4/13/2017 93
4/14/2017 57
4/15/2017 23
4/16/2017 15
4/17/2017 6
4/18/2017 91
4/19/2017 87
4/20/2017 44
Here current date is 20/04/2017, My question is grouping data from 19/04/2017 to 11/03/2017 with 4 equal parts with summation sales in r programming?
Eg :
library("xts")
ep <- endpoints(data, on = 'days', k = 4)
period.apply(data,ep,sum)
it's not working. However, its taking start date to current date but I need to geatherd data from yestderday (19/4/2017) to start date and split into 4 equal parts.
kindly anyone guide me soon.
Thank you
Base R has a function cut.Date() which is built for the purpose.
However, the question is not fully clear on what the OP intends. My understanding of the requirements supplied in Q and additional comment is:
Take the sales data per day in Book1 but leave out the current day, i.e., use only completed days.
Group the data in four equal parts, i.e., four periods containing an equal number of days. (Note that the title of the Q and the attempt to use xts::endpoint() with k = 4 indicates that the OP might have a different intention to group the data in periods of four days length each.)
Summarize the sales figures by period
For the sake of brevity, data.table is used here for data manipulation and aggregation, lubridate for date manipulation
library(data.table)
library(lubridate)
# coerce to data.table, convert Date column from character to class Date,
# exclude the actual date
temp <- setDT(Book1)[, Date := mdy(Book1$Date)][Date != today()]
# cut the date range in four parts
temp[, start_date_of_period := cut.Date(Date, 4)]
temp
# Date Sales start_date_of_period
# 1: 2017-03-11 1 2017-03-11
# 2: 2017-03-12 0 2017-03-11
# 3: 2017-03-13 40 2017-03-11
# ...
#38: 2017-04-17 6 2017-04-10
#39: 2017-04-18 91 2017-04-10
#40: 2017-04-19 87 2017-04-10
# Date Sales start_date_of_period
# aggregate sales by period
temp[, .(n_days = .N, total_sales = sum(Sales)), by = start_date_of_period]
# start_date_of_period n_days total_sales
#1: 2017-03-11 10 348
#2: 2017-03-21 10 589
#3: 2017-03-31 10 462
#4: 2017-04-10 10 507
Thanks to chaining, this can be put together in one statement without using a temporary variable:
setDT(Book1)[, Date := mdy(Book1$Date)][Date != today()][
, start_date_of_period := cut.Date(Date, 4)][
, .(n_days = .N, total_sales = sum(Sales)), by = start_date_of_period]
Note If you want to reproduce the result in the future, you will have to replace the call to today() which excludes the current day by mdy("4/20/2017") which is the last day in the sample data set supplied by the OP.
I have a vector of dates like this:
1 2014-03-10 22:54:24
2 2014-03-10 22:53:16
3 2014-03-10 22:53:01
4 2014-03-10 22:52:38
5 2014-03-10 22:52:00
6 2014-03-01 01:13:08
7 2014-03-01 01:11:30
8 2014-03-01 01:07:41
9 2014-03-01 01:05:28
10 2014-03-01 00:58:40
11 2014-03-27 18:11:57
How can I group by month, day, morning, afternoon or week? For instance:
month sum
2014-3 11
==============
week sum
2014-3-1 5
2014-3-9 5
==============
2014-3-1
morning sum
2014-3-1 5
Use the package data.table and get known of the class POSIXlt.
#x is assumed to be you're vector of time objects (POSIXct POSIXlt).
# The following lines are just for getting known to POSIXlt. You do not need to run these.
Secs <- as.POSIXlt(x)[[1]]
Mins <- as.POSIXlt(x)[[2]]
# ...
Month <- as.POSIXlt(x)[[5]] + 1 # months do start with 0 instead of 1
Year <- as.POSIXlt(x)[[6]] - 100 #for 2016 the result would be 116 ...
DayOfYear <- as.POSIXlt(x)[[9]] + 1 #starts with 0
You can calculate more complicated values similarly. Use data.table now.
require(data.table)
X <- as.data.table(x) # creates a data.table object
setnames(X, "Time") # names the 1 column 'Time'
X[ , month := as.POSIXlt(Time)[[5]] + 1] #adds a column month
X[ , doy:= as.POSIXlt(Time)[[8]] + 1] #adds a column day of year
#....
Now you can group your data.table with:
X[ , .N, by = doy]
X[ , .N, by = month]
# ...
.N returns the number of items in each group. You could also combine the grouping:
X[ , .N, by = list(doy, month)]
There are many nice tutorials using data.tables and the grouping and evaluation is similar to sql syntax (which can also be found in tutorials).
A good link to start is the FAQ of the developer:
http://datatable.r-forge.r-project.org/datatable-faq.pdf
EDIT:
Of course you could also make more complicated columns for afternoon and morning like this:
X[ , afternoon:= ifelse(as.POSIXlt(x)[[3]] > 12, TRUE, FALSE)]
Assuming you have a data frame like this where time is in POSIXct format:
df
time
1 2014-03-10 22:54:24
2 2014-03-10 22:53:16
3 2014-03-10 22:53:01
4 2014-03-10 22:52:38
5 2014-03-10 22:52:00
6 2014-03-01 01:13:08
7 2014-03-01 01:11:30
8 2014-03-01 01:07:41
9 2014-03-01 01:05:28
10 2014-03-01 00:58:40
11 2014-03-27 18:11:57
You can get month, week and am/pm as follows:
df$month <- format(df$time, '%Y-%m')
df$week <- format(df$time, '%Y-%U')
df$ampm <- ifelse(as.numeric(format(df$time, '%H')) > 12, 'pm', 'am')
df
time month week ampm
1 2014-03-10 22:54:24 2014-03 2014-10 pm
2 2014-03-10 22:53:16 2014-03 2014-10 pm
3 2014-03-10 22:53:01 2014-03 2014-10 pm
4 2014-03-10 22:52:38 2014-03 2014-10 pm
5 2014-03-10 22:52:00 2014-03 2014-10 pm
6 2014-03-01 01:13:08 2014-03 2014-08 am
7 2014-03-01 01:11:30 2014-03 2014-08 am
8 2014-03-01 01:07:41 2014-03 2014-08 am
9 2014-03-01 01:05:28 2014-03 2014-08 am
10 2014-03-01 00:58:40 2014-03 2014-08 am
11 2014-03-27 18:11:57 2014-03 2014-12 pm
Then, you can get your summaries using library dplyr like this:
library(dplyr)
count(df, month)
Source: local data frame [1 x 2]
month n
(chr) (int)
1 2014-03 11
count(df, week)
Source: local data frame [3 x 2]
week n
(chr) (int)
1 2014-08 5
2 2014-10 5
3 2014-12 1
count(df, ampm)
Source: local data frame [2 x 2]
ampm n
(chr) (int)
1 am 5
2 pm 6
I cleared one hurdle, with some help from SO and thought the next hurdle would be easier. What I really have is start and end dates in a data frame:
require(lubridate)
demo <- read.table(text = "
start end num
2010-12-31 <NA> 35
2013-04-01 <NA> 34
2015-06-02 <NA> 34
2015-06-15 2012-12-31 34
2015-01-30 2011-12-31 33
2014-04-15 2013-12-31 33
2014-05-28 2013-12-31 33
2014-06-02 <NA> 33
2015-06-17 <NA> 33
2015-06-25 <NA> 33
2015-06-24 <NA> 32
2013-07-31 <NA> 32
2013-08-31 <NA> 32
2015-04-27 <NA> 31
2015-05-07 <NA> 31
2013-12-30 <NA> 31
2014-11-21 <NA> 30
2013-12-20 2013-06-30 30
",header = TRUE, sep = "")
demo$start <- as.Date(demo$start, '%Y-%m-%d')
demo$end <- as.Date(demo$end, '%Y-%m-%d')
I can get a table of start years, or a table of end years, with table(year(demo$end)) or table(year(demo$start)) which is a lovely start. But what I really want to know is something more like: for each year, how many entries that started have not yet ended? So count is.na() for each start year.
I thought I could use aggregate() for that, but this:
aggregate(is.na(end) ~ year(start), demo, FUN = length)
But that seems to be counting every observation, not just the observations for which the end date is.na()
You can use table with multiple arguments to give you 2-way or multi-way tables:
> with(demo, table( year=format(demo$start, "%Y"), Not.missing = !is.na(end) ) )
Not.missing
year FALSE TRUE
2010 1 0
2013 4 1
2014 2 2
2015 6 2
You could also use lubridate::year instead of hte format call.
If you need to find the number of NA values for each 'year', we can use sum as the is.na(end) is a logical vector. The length gives the total length of the vector per year instead of the length of the TRUE values
aggregate(cbind(end=is.na(end)) ~ cbind(year=year(start)), demo, FUN = sum)
# year end
#1 2010 1
#2 2013 4
#3 2014 2
#4 2015 6
Or we can use data.table. We convert the 'data.frame' to 'data.table' (setDT(demo)), grouped by the year of the 'start' column and using i as is.na(end) as row index, we get the .N or the number of elements for each group.
library(data.table)
setDT(demo)[is.na(end), list(end = .N) , list(year=year(start))]
# year end
#1: 2010 1
#2: 2013 4
#3: 2015 6
#4: 2014 2
Here is another option:
library(dplyr)
library(lubridate)
demo %>% subset(is.na(end)) %>% group_by(year(start)) %>% summarise(n=length(end))
#Source: local data frame [4 x 2]
#
# year(start) n
#1 2010 1
#2 2013 4
#3 2014 2
#4 2015 6
This is pretty straightforward. With your original data (demo), subset to only get the NA in your end column. Afterwards (and using year() from the lubridate package), group by each year, and get the summary of the number of NAs present in the end column. This will return a data.frame object.