Compare two time series - r

I have two time series with hourly resolution now I want to compare the load time series with the capacity time series and count the number of hours when the load is bigger than the capacity. So to know for each hour if there is enough capacity to meet the load. And to calculate the exact difference in cases when there is not enough capacity.
library(xts)
load<-c(81,81,82,98,81,67,90,92,75,78,83,83,83,43,97,92,72,85,62)
capacity<-c(78,97,78,65,45,98,67,109,78,109,52,42,97,87,83,90,99,89,125)
time1<-seq(from=as.POSIXct("2013-01-01 00:00"),to=as.POSIXct("2013-01-01 18:00"),by="hour")
dat0<-data.frame(load,capacity)
df1<-xts(dat0,order.by=time1)
df1
load capacity
2013-01-01 00:00:00 81 78
2013-01-01 01:00:00 81 97
2013-01-01 02:00:00 82 78
2013-01-01 03:00:00 98 65
2013-01-01 04:00:00 81 45
2013-01-01 05:00:00 67 98
2013-01-01 06:00:00 90 67
2013-01-01 07:00:00 92 109
2013-01-01 08:00:00 75 78
2013-01-01 09:00:00 78 109
2013-01-01 10:00:00 83 52
2013-01-01 11:00:00 83 42
2013-01-01 12:00:00 83 97
2013-01-01 13:00:00 43 87
2013-01-01 14:00:00 97 83
2013-01-01 15:00:00 92 90
2013-01-01 16:00:00 72 99
2013-01-01 17:00:00 85 89
2013-01-01 18:00:00 62 125
I just want to know what is the fastest way to calculate it. I need to compare 10 years of data.

I would suggest using dplyr which runs considerably fast on large datasets. Check out the following piece of code and also make sure to have a look at the official Introduction to dplyr.
library(dplyr)
## difference between capacity and load
dat0 %>%
mutate(diff = capacity - load) -> dat1
## count hours with sufficient capacity
dat1 %>%
count(sufficient = diff >= 0) %>%
data.frame()
And here's the console output of the second operation.
sufficient n
1 FALSE 9
2 TRUE 10

Related

R method to shift standard to daylight savings time by shifting entire dataset up an hour/down an hour for half the year?

There are lots of questions about daylight savings conversion and posixct/posixlt, date.time, etc., but none that I have found appear to address what my approach would be to daylight savings.
I am interested in analyzing daily load curves for energy use, and approaches which just cut the spring hour out of the dataset do not work for me. I need an approach that shifts all measurements to the subsequent hour after spring daylight savings and to the prior hour after the fall adjustment. See below for a clear example.
EnergyUse <- data.table("Date"= c("1997-04-06", "1997-04-06", "1997-04-06", "1997-04-06"), "Hour"= 1:4, "Power"=c(30,40,60,80))
print(EnergyUse)
# Date Hour Power
#1: 1997-04-06 1 30
#2: 1997-04-06 2 40 #when daylight savings kicked in for 1997
#3: 1997-04-06 3 60
#4: 1997-04-06 4 80
The "Hour" field ranges from 0 to 23 for every day of the year, i.e. "local standard time". It happens to be Pacific Time, as you will see below, but I would have the same question for any time zone that implemented daylight savings.
Now I need to merge date and time field into single date_time field formatted as date and time and incorporating daylight savings, as I am interested in the hourly power patterns (i.e. load curves), which shift both based on relative time (e.g. when people go to/get off work) and absolute time (e.g. when it gets cold/hot or when sun sets).
EnergyUseAdj <- EnergyUse[, Date_Time := as.POSIXct(paste(Date, Hour), format="%Y-%m-%d %H", tz="America/Los_Angeles")]
which results in:
print(EnergyUseAdj)
# Date Hour Power Date_Time
#1: 1997-04-06 1 30 1997-04-06 01:00:00
#2: 1997-04-06 2 40 <NA>
#3: 1997-04-06 3 60 1997-04-06 03:00:00
#4: 1997-04-06 4 80 1997-04-06 04:00:00
This, however, makes the "Power" data for the new daylight savings 3am and 4am incorrect. The actual power production figure for the daylight adjusted 3am would instead be that which was listed for 2am standard time (i.e. 40), and that for 4am would then be 60.
The correct way to adjust for this, albeit likely more computationally expensive for large datasets, would be to adjust the entire time-series by a positive offset of 1 hour in spring and a negative offset of 1 hour in fall, like the below:
# Date Hour Power Date_Time
#1: 1997-04-06 1 30 1997-04-06 01:00:00
#2: 1997-04-06 2 <NA> <NA>
#3: 1997-04-06 3 40 1997-04-06 03:00:00
#4: 1997-04-06 4 60 1997-04-06 04:00:00
Or, perhaps smoother for use in other algorithms due to lack of NA lines, like the below:
# Date Hour Power Date_Time
#1: 1997-04-06 1 30 1997-04-06 01:00:00
#2: 1997-04-06 3 40 1997-04-06 03:00:00
#3: 1997-04-06 4 60 1997-04-06 04:00:00
#4: 1997-04-06 5 80 1997-04-06 05:00:00
After toying around with Posixct and reading through a bunch of similar questions on this adjustment, I could not find a great solution. Any ideas?
EDIT: GregorThomas' request, see below for a larger sample of data in case you wish to use two days' worth.
# OP_DATE OP_HOUR Power
# 1: 1997-04-05 0 71
# 2: 1997-04-05 1 61
# 3: 1997-04-05 2 54
# 4: 1997-04-05 3 57
# 5: 1997-04-05 4 68
# 6: 1997-04-05 5 76
# 7: 1997-04-05 6 89
# 8: 1997-04-05 7 106
# 9: 1997-04-05 8 148
#10: 1997-04-05 9 154
#11: 1997-04-05 10 143
#12: 1997-04-05 11 123
#13: 1997-04-05 12 105
#14: 1997-04-05 13 94
#15: 1997-04-05 14 85
#16: 1997-04-05 15 86
#17: 1997-04-05 16 84
#18: 1997-04-05 17 83
#19: 1997-04-05 18 99
#20: 1997-04-05 19 105
#21: 1997-04-05 20 94
#22: 1997-04-05 21 95
#23: 1997-04-05 22 81
#24: 1997-04-05 23 66
#25: 1997-04-06 0 75
#26: 1997-04-06 1 70
#27: 1997-04-06 2 62
#28: 1997-04-06 3 56
#29: 1997-04-06 4 55
#30: 1997-04-06 5 57
#31: 1997-04-06 6 51
#32: 1997-04-06 7 57
#33: 1997-04-06 8 59
#34: 1997-04-06 9 61
#35: 1997-04-06 10 64
#36: 1997-04-06 11 63
#37: 1997-04-06 12 63
#38: 1997-04-06 13 63
#39: 1997-04-06 14 60
#40: 1997-04-06 15 68
#41: 1997-04-06 16 69
#42: 1997-04-06 17 69
#43: 1997-04-06 18 91
#44: 1997-04-06 19 120
#45: 1997-04-06 20 100
#46: 1997-04-06 21 74
#47: 1997-04-06 22 56
#48: 1997-04-06 23 55
If your data is reliably every hour, you can calculate a sequence of hours of the appropriate length. The implementation of POSIX datetimes accounts for daylight savings time, leap years, etc.
Simplifying the method in my comment, I'd recommending calculating the sequence based on the start time and the length.
EnergyUseAdj <- EnergyUse[,
Date_Time := seq(
from = as.POSIXct(paste(Date[1], Hour[1]), format="%Y-%m-%d %H", tz="America/Los_Angeles"),
length.out = .N,
by = "1 hour"
)]

counting the number of people in the system in R

I have the arrival time and departure time and date of different customers to a system. I want to count the number of people in the system in every 30 min. How can I do this R?
Here are my data
If I understand your question, here's an example with fake data:
library(tidyverse)
library(lubridate)
# Fake data
set.seed(2)
dat = data.frame(id=1:1000, type=rep(c("A","B"), 500),
arrival=as.POSIXct("2013-08-21 05:00:00") + sample(-10000:10000, 1000, replace=TRUE))
dat$departure = dat$arrival + sample(100:5000, 1000, replace=TRUE)
# Times when we want to check how many people are still present
times = seq(round_date(min(dat$arrival), "hour"), ceiling_date(max(dat$departure), "hour"), "30 min")
# Count number of people present at each time
map_df(times, function(x) {
dat %>%
group_by(type) %>%
summarise(Time = x,
Count=sum(arrival < x & departure > x)) %>%
spread(type, Count) %>%
mutate(Total = A + B)
})
Time A B Total
<dttm> <int> <int> <int>
1 2013-08-21 02:00:00 0 0 0
2 2013-08-21 02:30:00 26 31 57
3 2013-08-21 03:00:00 54 53 107
4 2013-08-21 03:30:00 75 81 156
5 2013-08-21 04:00:00 58 63 121
6 2013-08-21 04:30:00 66 58 124
7 2013-08-21 05:00:00 55 60 115
8 2013-08-21 05:30:00 52 63 115
9 2013-08-21 06:00:00 57 62 119
10 2013-08-21 06:30:00 62 51 113
11 2013-08-21 07:00:00 60 67 127
12 2013-08-21 07:30:00 72 54 126
13 2013-08-21 08:00:00 66 46 112
14 2013-08-21 08:30:00 19 12 31
15 2013-08-21 09:00:00 1 2 3
16 2013-08-21 09:30:00 0 0 0
17 2013-08-21 10:00:00 0 0 0
I'm not sure what you mean by counting the number of people "in the system", but I'm assuming you mean "the number of people who have arrived but not yet departed". To do this, you can apply a simple logical condition on the relevant columns of your dataframe, e.g.
logicVec <- df$arrival_time <= dateTimeObj & dateTimeObj < df$departure_time
LogicVec will evidently be a logical vector of TRUEs and FALSEs. Because TRUE == 1 and FALSE == 0, you can then simply use the sum(logicVec) function to get the the total number of people/customers/rows who fulfill the condition written above.
You can then simply repeat this line of code for every dateTimeObj (of class e.g. POSIXct) you want. In your case, it would be every dateTimeObj where each are 30 minutes apart.
I hope this helps.

How to calculate average time interval based on unique value?

I'm having trouble when trying to calculate the average time interval (how many days) between appearances of the same value in another column.
My data looks like this:
dt subject_id
2016-09-13 77
2016-11-07 1791
2016-09-18 1332
2016-08-31 84
2016-08-23 89
2016-08-23 41
2016-09-15 41
2016-10-12 93
2016-10-05 93
2016-11-09 94
2016-10-25 94
2016-11-03 94
2016-10-09 375
2016-10-14 11
2016-09-27 11
2016-09-13 11
2016-08-23 11
2016-08-27 11
And I want to get something like this:
subject_id mean_day
41 23
93 7
94 7.5
11 13
I tried to use:
aggregate(dt~subject_id, data, mean)
But it can't calculate mean from Date values. Any ideas?
My first approach would be something like this:
df$dt <- as.Date(df$dt)
library(dplyr)
df %>%
group_by(subject_id) %>%
summarise((max(dt) - min(dt))/(n()-1))
# <int> <time>
#1 11 13.0 days
#2 41 23.0 days
#3 77 NaN days
#4 84 NaN days
#5 89 NaN days
#6 93 7.0 days
#7 94 7.5 days
#8 375 NaN days
#9 1332 NaN days
#10 1791 NaN days
I think it's a starting point for you ... you can modify as you want.

R: Fill in all elements of sequence of datetime with patchy periodic datetime information

I guess I don't even know really what to 'title' this question as.
But I think this is quite a common data manipulation requirement.
I have data that has a periodic exchange between two parties of a quantity of a good. The exchanges are made hourly. Here is an example data frame:
df <- cbind.data.frame(Seller = as.character(c("A","A","A","A","A","A")),
Buyer = c("B","B","B","C","C","C"),
DateTimeFrom = c("1/07/2013 0:00","1/07/2013 9:00","1/07/2013 0:00","1/07/2013 6:00","1/07/2013 8:00","2/07/2013 9:00"),
DateTimeTo = c("1/07/2013 8:00","1/07/2013 15:00","2/07/2013 8:00","1/07/2013 9:00","1/07/2013 12:00","2/07/2013 16:00"),
Qty = c(50,10,20,25,5,5)
)
df$DateTimeFrom <- as.POSIXct(df$DateTimeFrom, format = '%d/%m/%Y %H:%M', tz = 'GMT')
df$DateTimeTo <- as.POSIXct(df$DateTimeTo, format = '%d/%m/%Y %H:%M', tz = 'GMT')
> df
Seller Buyer DateTimeFrom DateTimeTo Qty
1 A B 2013-07-01 00:00:00 2013-07-01 08:00:00 50
2 A B 2013-07-01 09:00:00 2013-07-01 15:00:00 10
3 A B 2013-07-01 00:00:00 2013-07-02 08:00:00 20
4 A C 2013-07-01 06:00:00 2013-07-01 09:00:00 25
5 A C 2013-07-01 08:00:00 2013-07-01 12:00:00 5
6 A C 2013-07-02 09:00:00 2013-07-02 16:00:00 5
So, for example, the first row of this data frame says that the Seller "A" sells 50 units of the good to the buyer "B" every hour from midnight on 1/7/13 until 8am on 1/7/13. You can also notice that some of these exchanges between the same two parties can overlap, but just with a different negotiated quantity.
What I need to do (and need your help with) is to generate a sequence covering all hours over this two day period that sums the total quantity exchanged in that hour between two sellers over all neogociations.
Here would be the resulting dataframe.
DateTimeSeq <- data.frame(seq(ISOdate(2013,7,1,0),by = "hour", length.out = 48))
colnames(DateTimeSeq) <- c("DateTime")
#What the Answer should be
DateTimeSeq$QtyAB <- c(70,70,70,70,70,70,70,70,70,30,30,30,30,30,30,30,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
DateTimeSeq$QtyAC <- c(0,0,0,0,0,0,25,25,30,30,5,5,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,5,5,5,5,5,5,5,0,0,0,0,0,0,0)
> DateTimeSeq
DateTime QtyAB QtyAC
1 2013-07-01 00:00:00 70 0
2 2013-07-01 01:00:00 70 0
3 2013-07-01 02:00:00 70 0
4 2013-07-01 03:00:00 70 0
5 2013-07-01 04:00:00 70 0
6 2013-07-01 05:00:00 70 0
7 2013-07-01 06:00:00 70 25
8 2013-07-01 07:00:00 70 25
9 2013-07-01 08:00:00 70 30
10 2013-07-01 09:00:00 30 30
11 2013-07-01 10:00:00 30 5
12 2013-07-01 11:00:00 30 5
13 2013-07-01 12:00:00 30 5
14 2013-07-01 13:00:00 30 0
15 2013-07-01 14:00:00 30 0
.... etc
Anybody able to lend a hand?
Thanks,
A
Here is my solution which uses the dplyr and reshape package.
library(dplyr)
library(reshape)
Firstly, we should expand the dataframe so that everything is in an hourly format. This can be done using the do part of dplyr.
df %>% rowwise() %>%
do(data.frame(Seller=.$Seller,
Buyer=.$Buyer,
Qty=.$Qty,
DateTimeCurr=seq(from=.$DateTimeFrom, to=.$DateTimeTo, by="hour")))
Output:
Source: local data frame [66 x 4]
Groups: <by row>
Seller Buyer Qty DateTimeCurr
1 A B 50 2013-07-01 00:00:00
2 A B 50 2013-07-01 01:00:00
3 A B 50 2013-07-01 02:00:00
...
From there it is trivial to get the correct id's and summarise the total using the group_by function.
df1 <- df %>% rowwise() %>%
do(data.frame(Seller=.$Seller,
Buyer=.$Buyer,
Qty=.$Qty,
DateTimeCurr=seq(from=.$DateTimeFrom, to=.$DateTimeTo, by="hour"))) %>%
group_by(Seller, Buyer, DateTimeCurr) %>%
summarise(TotalQty=sum(Qty)) %>%
mutate(id=paste0("Qty", Seller, Buyer))
Output:
Source: local data frame [48 x 5]
Groups: Seller, Buyer
Seller Buyer DateTimeCurr TotalQty id
1 A B 2013-07-01 00:00:00 70 QtyAB
2 A B 2013-07-01 01:00:00 70 QtyAB
3 A B 2013-07-01 02:00:00 70 QtyAB
From this dataframe, all we have to do is cast it into the format you have above.
> cast(df1, DateTimeCurr~ id, value="TotalQty")
DateTimeCurr QtyAB QtyAC
1 2013-07-01 00:00:00 70 NA
2 2013-07-01 01:00:00 70 NA
3 2013-07-01 02:00:00 70 NA
4 2013-07-01 03:00:00 70 NA
5 2013-07-01 04:00:00 70 NA
6 2013-07-01 05:00:00 70 NA
So the whole piece of code
df1 <- df %>% rowwise() %>%
do(data.frame(Seller=.$Seller,
Buyer=.$Buyer,
Qty=.$Qty,
DateTimeCurr=seq(from=.$DateTimeFrom, to=.$DateTimeTo, by="hour"))) %>%
group_by(Seller, Buyer, DateTimeCurr) %>%
summarise(TotalQty=sum(Qty)) %>%
mutate(id=paste0("Qty", Seller, Buyer))
cast(df1, DateTimeCurr~ id, value="TotalQty")

Extract and compare column data by date in R

I am using a Kaggle data set for bike sharing. I would like to write script that compares my predicted values to the training data set. I would like comparisons of the mean by month for each year.
The training data set, I call df looks like this:
datetime count
1 2011-01-01 00:00:00 16
2 2011-01-11 01:00:00 40
3 2011-02-01 02:00:00 32
4 2011-02-11 03:00:00 13
5 2011-03-21 04:00:00 1
6 2011-03-11 05:00:00 1
My predicted values, I call sub look like this:
datetime count
1 2011-01-01 00:00:00 42
2 2011-01-11 01:00:00 33
3 2011-02-01 02:00:00 33
4 2011-02-11 05:00:00 36
5 2011-03-21 06:00:00 57
6 2011-03-11 07:00:00 129
I have isolated the month and year using the lubridate package. Then concatenated the month-date as a new column. I used the new column and split, then use lapply to find the mean.
library(lubridate)
df$monyear <- interaction(
month(ymd_hms(df$datetime)),
year(ymd_hms(df$datetime)),
sep="-")
s<-split(df,df$monyear)
x <-lapply(s,function(x) colMeans(x[,c("count", "count")],na.rm=TRUE))
But this gives me the average for each month-year combination nested in a list so it is not easy to compare. What I would like instead is :
year-month train-mean sub-mean diff
1 2011-01 28 37.5 9.5
2 2011-02 22.5 34.5 12
3 2011-03 1 93 92
Is there a better way to do this?
Something like this. For each of your data sets:
library(dplyr)
dftrain %>% group_by(monyear) %>% summarize(mc=mean(count)) -> xtrain
dftest %>% group_by(monyear) %>% summarize(mc=mean(count)) -> xtest
merged <- merge(xtrain, xtest, by="monyear")

Categories

Resources