There are lots of questions about daylight savings conversion and posixct/posixlt, date.time, etc., but none that I have found appear to address what my approach would be to daylight savings.
I am interested in analyzing daily load curves for energy use, and approaches which just cut the spring hour out of the dataset do not work for me. I need an approach that shifts all measurements to the subsequent hour after spring daylight savings and to the prior hour after the fall adjustment. See below for a clear example.
EnergyUse <- data.table("Date"= c("1997-04-06", "1997-04-06", "1997-04-06", "1997-04-06"), "Hour"= 1:4, "Power"=c(30,40,60,80))
print(EnergyUse)
# Date Hour Power
#1: 1997-04-06 1 30
#2: 1997-04-06 2 40 #when daylight savings kicked in for 1997
#3: 1997-04-06 3 60
#4: 1997-04-06 4 80
The "Hour" field ranges from 0 to 23 for every day of the year, i.e. "local standard time". It happens to be Pacific Time, as you will see below, but I would have the same question for any time zone that implemented daylight savings.
Now I need to merge date and time field into single date_time field formatted as date and time and incorporating daylight savings, as I am interested in the hourly power patterns (i.e. load curves), which shift both based on relative time (e.g. when people go to/get off work) and absolute time (e.g. when it gets cold/hot or when sun sets).
EnergyUseAdj <- EnergyUse[, Date_Time := as.POSIXct(paste(Date, Hour), format="%Y-%m-%d %H", tz="America/Los_Angeles")]
which results in:
print(EnergyUseAdj)
# Date Hour Power Date_Time
#1: 1997-04-06 1 30 1997-04-06 01:00:00
#2: 1997-04-06 2 40 <NA>
#3: 1997-04-06 3 60 1997-04-06 03:00:00
#4: 1997-04-06 4 80 1997-04-06 04:00:00
This, however, makes the "Power" data for the new daylight savings 3am and 4am incorrect. The actual power production figure for the daylight adjusted 3am would instead be that which was listed for 2am standard time (i.e. 40), and that for 4am would then be 60.
The correct way to adjust for this, albeit likely more computationally expensive for large datasets, would be to adjust the entire time-series by a positive offset of 1 hour in spring and a negative offset of 1 hour in fall, like the below:
# Date Hour Power Date_Time
#1: 1997-04-06 1 30 1997-04-06 01:00:00
#2: 1997-04-06 2 <NA> <NA>
#3: 1997-04-06 3 40 1997-04-06 03:00:00
#4: 1997-04-06 4 60 1997-04-06 04:00:00
Or, perhaps smoother for use in other algorithms due to lack of NA lines, like the below:
# Date Hour Power Date_Time
#1: 1997-04-06 1 30 1997-04-06 01:00:00
#2: 1997-04-06 3 40 1997-04-06 03:00:00
#3: 1997-04-06 4 60 1997-04-06 04:00:00
#4: 1997-04-06 5 80 1997-04-06 05:00:00
After toying around with Posixct and reading through a bunch of similar questions on this adjustment, I could not find a great solution. Any ideas?
EDIT: GregorThomas' request, see below for a larger sample of data in case you wish to use two days' worth.
# OP_DATE OP_HOUR Power
# 1: 1997-04-05 0 71
# 2: 1997-04-05 1 61
# 3: 1997-04-05 2 54
# 4: 1997-04-05 3 57
# 5: 1997-04-05 4 68
# 6: 1997-04-05 5 76
# 7: 1997-04-05 6 89
# 8: 1997-04-05 7 106
# 9: 1997-04-05 8 148
#10: 1997-04-05 9 154
#11: 1997-04-05 10 143
#12: 1997-04-05 11 123
#13: 1997-04-05 12 105
#14: 1997-04-05 13 94
#15: 1997-04-05 14 85
#16: 1997-04-05 15 86
#17: 1997-04-05 16 84
#18: 1997-04-05 17 83
#19: 1997-04-05 18 99
#20: 1997-04-05 19 105
#21: 1997-04-05 20 94
#22: 1997-04-05 21 95
#23: 1997-04-05 22 81
#24: 1997-04-05 23 66
#25: 1997-04-06 0 75
#26: 1997-04-06 1 70
#27: 1997-04-06 2 62
#28: 1997-04-06 3 56
#29: 1997-04-06 4 55
#30: 1997-04-06 5 57
#31: 1997-04-06 6 51
#32: 1997-04-06 7 57
#33: 1997-04-06 8 59
#34: 1997-04-06 9 61
#35: 1997-04-06 10 64
#36: 1997-04-06 11 63
#37: 1997-04-06 12 63
#38: 1997-04-06 13 63
#39: 1997-04-06 14 60
#40: 1997-04-06 15 68
#41: 1997-04-06 16 69
#42: 1997-04-06 17 69
#43: 1997-04-06 18 91
#44: 1997-04-06 19 120
#45: 1997-04-06 20 100
#46: 1997-04-06 21 74
#47: 1997-04-06 22 56
#48: 1997-04-06 23 55
If your data is reliably every hour, you can calculate a sequence of hours of the appropriate length. The implementation of POSIX datetimes accounts for daylight savings time, leap years, etc.
Simplifying the method in my comment, I'd recommending calculating the sequence based on the start time and the length.
EnergyUseAdj <- EnergyUse[,
Date_Time := seq(
from = as.POSIXct(paste(Date[1], Hour[1]), format="%Y-%m-%d %H", tz="America/Los_Angeles"),
length.out = .N,
by = "1 hour"
)]
I have the arrival time and departure time and date of different customers to a system. I want to count the number of people in the system in every 30 min. How can I do this R?
Here are my data
If I understand your question, here's an example with fake data:
library(tidyverse)
library(lubridate)
# Fake data
set.seed(2)
dat = data.frame(id=1:1000, type=rep(c("A","B"), 500),
arrival=as.POSIXct("2013-08-21 05:00:00") + sample(-10000:10000, 1000, replace=TRUE))
dat$departure = dat$arrival + sample(100:5000, 1000, replace=TRUE)
# Times when we want to check how many people are still present
times = seq(round_date(min(dat$arrival), "hour"), ceiling_date(max(dat$departure), "hour"), "30 min")
# Count number of people present at each time
map_df(times, function(x) {
dat %>%
group_by(type) %>%
summarise(Time = x,
Count=sum(arrival < x & departure > x)) %>%
spread(type, Count) %>%
mutate(Total = A + B)
})
Time A B Total
<dttm> <int> <int> <int>
1 2013-08-21 02:00:00 0 0 0
2 2013-08-21 02:30:00 26 31 57
3 2013-08-21 03:00:00 54 53 107
4 2013-08-21 03:30:00 75 81 156
5 2013-08-21 04:00:00 58 63 121
6 2013-08-21 04:30:00 66 58 124
7 2013-08-21 05:00:00 55 60 115
8 2013-08-21 05:30:00 52 63 115
9 2013-08-21 06:00:00 57 62 119
10 2013-08-21 06:30:00 62 51 113
11 2013-08-21 07:00:00 60 67 127
12 2013-08-21 07:30:00 72 54 126
13 2013-08-21 08:00:00 66 46 112
14 2013-08-21 08:30:00 19 12 31
15 2013-08-21 09:00:00 1 2 3
16 2013-08-21 09:30:00 0 0 0
17 2013-08-21 10:00:00 0 0 0
I'm not sure what you mean by counting the number of people "in the system", but I'm assuming you mean "the number of people who have arrived but not yet departed". To do this, you can apply a simple logical condition on the relevant columns of your dataframe, e.g.
logicVec <- df$arrival_time <= dateTimeObj & dateTimeObj < df$departure_time
LogicVec will evidently be a logical vector of TRUEs and FALSEs. Because TRUE == 1 and FALSE == 0, you can then simply use the sum(logicVec) function to get the the total number of people/customers/rows who fulfill the condition written above.
You can then simply repeat this line of code for every dateTimeObj (of class e.g. POSIXct) you want. In your case, it would be every dateTimeObj where each are 30 minutes apart.
I hope this helps.
I guess I don't even know really what to 'title' this question as.
But I think this is quite a common data manipulation requirement.
I have data that has a periodic exchange between two parties of a quantity of a good. The exchanges are made hourly. Here is an example data frame:
df <- cbind.data.frame(Seller = as.character(c("A","A","A","A","A","A")),
Buyer = c("B","B","B","C","C","C"),
DateTimeFrom = c("1/07/2013 0:00","1/07/2013 9:00","1/07/2013 0:00","1/07/2013 6:00","1/07/2013 8:00","2/07/2013 9:00"),
DateTimeTo = c("1/07/2013 8:00","1/07/2013 15:00","2/07/2013 8:00","1/07/2013 9:00","1/07/2013 12:00","2/07/2013 16:00"),
Qty = c(50,10,20,25,5,5)
)
df$DateTimeFrom <- as.POSIXct(df$DateTimeFrom, format = '%d/%m/%Y %H:%M', tz = 'GMT')
df$DateTimeTo <- as.POSIXct(df$DateTimeTo, format = '%d/%m/%Y %H:%M', tz = 'GMT')
> df
Seller Buyer DateTimeFrom DateTimeTo Qty
1 A B 2013-07-01 00:00:00 2013-07-01 08:00:00 50
2 A B 2013-07-01 09:00:00 2013-07-01 15:00:00 10
3 A B 2013-07-01 00:00:00 2013-07-02 08:00:00 20
4 A C 2013-07-01 06:00:00 2013-07-01 09:00:00 25
5 A C 2013-07-01 08:00:00 2013-07-01 12:00:00 5
6 A C 2013-07-02 09:00:00 2013-07-02 16:00:00 5
So, for example, the first row of this data frame says that the Seller "A" sells 50 units of the good to the buyer "B" every hour from midnight on 1/7/13 until 8am on 1/7/13. You can also notice that some of these exchanges between the same two parties can overlap, but just with a different negotiated quantity.
What I need to do (and need your help with) is to generate a sequence covering all hours over this two day period that sums the total quantity exchanged in that hour between two sellers over all neogociations.
Here would be the resulting dataframe.
DateTimeSeq <- data.frame(seq(ISOdate(2013,7,1,0),by = "hour", length.out = 48))
colnames(DateTimeSeq) <- c("DateTime")
#What the Answer should be
DateTimeSeq$QtyAB <- c(70,70,70,70,70,70,70,70,70,30,30,30,30,30,30,30,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
DateTimeSeq$QtyAC <- c(0,0,0,0,0,0,25,25,30,30,5,5,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,5,5,5,5,5,5,5,0,0,0,0,0,0,0)
> DateTimeSeq
DateTime QtyAB QtyAC
1 2013-07-01 00:00:00 70 0
2 2013-07-01 01:00:00 70 0
3 2013-07-01 02:00:00 70 0
4 2013-07-01 03:00:00 70 0
5 2013-07-01 04:00:00 70 0
6 2013-07-01 05:00:00 70 0
7 2013-07-01 06:00:00 70 25
8 2013-07-01 07:00:00 70 25
9 2013-07-01 08:00:00 70 30
10 2013-07-01 09:00:00 30 30
11 2013-07-01 10:00:00 30 5
12 2013-07-01 11:00:00 30 5
13 2013-07-01 12:00:00 30 5
14 2013-07-01 13:00:00 30 0
15 2013-07-01 14:00:00 30 0
.... etc
Anybody able to lend a hand?
Thanks,
A
Here is my solution which uses the dplyr and reshape package.
library(dplyr)
library(reshape)
Firstly, we should expand the dataframe so that everything is in an hourly format. This can be done using the do part of dplyr.
df %>% rowwise() %>%
do(data.frame(Seller=.$Seller,
Buyer=.$Buyer,
Qty=.$Qty,
DateTimeCurr=seq(from=.$DateTimeFrom, to=.$DateTimeTo, by="hour")))
Output:
Source: local data frame [66 x 4]
Groups: <by row>
Seller Buyer Qty DateTimeCurr
1 A B 50 2013-07-01 00:00:00
2 A B 50 2013-07-01 01:00:00
3 A B 50 2013-07-01 02:00:00
...
From there it is trivial to get the correct id's and summarise the total using the group_by function.
df1 <- df %>% rowwise() %>%
do(data.frame(Seller=.$Seller,
Buyer=.$Buyer,
Qty=.$Qty,
DateTimeCurr=seq(from=.$DateTimeFrom, to=.$DateTimeTo, by="hour"))) %>%
group_by(Seller, Buyer, DateTimeCurr) %>%
summarise(TotalQty=sum(Qty)) %>%
mutate(id=paste0("Qty", Seller, Buyer))
Output:
Source: local data frame [48 x 5]
Groups: Seller, Buyer
Seller Buyer DateTimeCurr TotalQty id
1 A B 2013-07-01 00:00:00 70 QtyAB
2 A B 2013-07-01 01:00:00 70 QtyAB
3 A B 2013-07-01 02:00:00 70 QtyAB
From this dataframe, all we have to do is cast it into the format you have above.
> cast(df1, DateTimeCurr~ id, value="TotalQty")
DateTimeCurr QtyAB QtyAC
1 2013-07-01 00:00:00 70 NA
2 2013-07-01 01:00:00 70 NA
3 2013-07-01 02:00:00 70 NA
4 2013-07-01 03:00:00 70 NA
5 2013-07-01 04:00:00 70 NA
6 2013-07-01 05:00:00 70 NA
So the whole piece of code
df1 <- df %>% rowwise() %>%
do(data.frame(Seller=.$Seller,
Buyer=.$Buyer,
Qty=.$Qty,
DateTimeCurr=seq(from=.$DateTimeFrom, to=.$DateTimeTo, by="hour"))) %>%
group_by(Seller, Buyer, DateTimeCurr) %>%
summarise(TotalQty=sum(Qty)) %>%
mutate(id=paste0("Qty", Seller, Buyer))
cast(df1, DateTimeCurr~ id, value="TotalQty")
I am using a Kaggle data set for bike sharing. I would like to write script that compares my predicted values to the training data set. I would like comparisons of the mean by month for each year.
The training data set, I call df looks like this:
datetime count
1 2011-01-01 00:00:00 16
2 2011-01-11 01:00:00 40
3 2011-02-01 02:00:00 32
4 2011-02-11 03:00:00 13
5 2011-03-21 04:00:00 1
6 2011-03-11 05:00:00 1
My predicted values, I call sub look like this:
datetime count
1 2011-01-01 00:00:00 42
2 2011-01-11 01:00:00 33
3 2011-02-01 02:00:00 33
4 2011-02-11 05:00:00 36
5 2011-03-21 06:00:00 57
6 2011-03-11 07:00:00 129
I have isolated the month and year using the lubridate package. Then concatenated the month-date as a new column. I used the new column and split, then use lapply to find the mean.
library(lubridate)
df$monyear <- interaction(
month(ymd_hms(df$datetime)),
year(ymd_hms(df$datetime)),
sep="-")
s<-split(df,df$monyear)
x <-lapply(s,function(x) colMeans(x[,c("count", "count")],na.rm=TRUE))
But this gives me the average for each month-year combination nested in a list so it is not easy to compare. What I would like instead is :
year-month train-mean sub-mean diff
1 2011-01 28 37.5 9.5
2 2011-02 22.5 34.5 12
3 2011-03 1 93 92
Is there a better way to do this?
Something like this. For each of your data sets:
library(dplyr)
dftrain %>% group_by(monyear) %>% summarize(mc=mean(count)) -> xtrain
dftest %>% group_by(monyear) %>% summarize(mc=mean(count)) -> xtest
merged <- merge(xtrain, xtest, by="monyear")