Related
I split the time from 2018-12-31 11:45:00 AM to 2018-12-31 and 11:45:00 aAM successfully.
However, I get difficulty that convert "11:45:00 AM" to 24 hours.
I know there are several ways to do that, for example, the most popular way is to use strptime and put format="%I:%M:%S %p. I did that several times and made double checked again and again... but still get N/A in my column. Here is, crimeData is my dataset name, toSplitHrs contains time which is "11:45:00 AM" just like what mentioned:
crimeData$toSplitHrs = strptime(crimeData$SplitHrs, format="%I:%M:%S %p")
Police.Beats SplitMs SplitHrs year month days hours mins sec toSplitHrs
1 28 2018-12-31 11:45:00 2018 12 31 11 45 00 <NA>
2 177 2018-12-31 11:42:00 2018 12 31 11 42 00 <NA>
3 233 2018-12-31 11:30:00 2018 12 31 11 30 00 <NA>
4 91 2018-12-31 11:30:00 2018 12 31 11 30 00 <NA>
5 73 2018-12-31 11:30:00 2018 12 31 11 30 00 <NA>
6 232 2018-12-31 11:27:00 2018 12 31 11 27 00 <NA>
but still, I got N/A result from that...
Also, this dataset contains over 10k observations, I really cannot change them one by one...any suggestions are appreciated!
You can try the format %r for the time, taking into account the am/pm specification (see ?strptime):
strptime("2018-12-31 11:45:00 am", format="%F %r")
#[1] "2018-12-31 11:45:00 CET"
strptime("2018-12-31 11:45:00 pm", format="%F %r")
#[1] "2018-12-31 23:45:00 CET"
I have a data frame df1 with a datetime column in format UTC. I need to merge this dataframe with the data frame df2 by the column datetime. My problem is that df2 is in Europe/Paris format, and when I transform df2$datetime from Europe/Paris to UTC format, I lose or duplicate data at the moments in which is the time change between either summer/winter or winter/summer. As an example:
df1<- data.frame(datetime=c("2016-10-29 22:00:00","2016-10-29 23:00:00","2016-10-30 00:00:00","2016-10-30 01:00:00","2016-10-30 02:00:00","2016-10-30 03:00:00","2016-10-30 04:00:00","2016-10-30 05:00:00","2016-03-25 22:00:00","2016-03-25 23:00:00","2016-03-26 00:00:00","2016-03-26 01:00:00","2016-03-26 02:00:00","2016-03-26 03:00:00","2016-03-26 04:00:00"), Var1= c(4, 56, 76, 54, 34, 3, 4, 6, 78, 23, 12, 3, 5, 6, 7))
df1$datetime<- as.POSIXct(df1$datetime, format = "%Y-%m-%d %H", tz= "UTC")
df2<- data.frame(datetime=c("2016-10-29 22:00:00","2016-10-29 23:00:00","2016-10-30 00:00:00","2016-10-30 01:00:00","2016-10-30 02:00:00","2016-10-30 03:00:00","2016-10-30 04:00:00","2016-10-30 05:00:00","2016-03-25 22:00:00","2016-03-25 23:00:00","2016-03-26 00:00:00","2016-03-26 01:00:00","2016-03-26 02:00:00","2016-03-26 03:00:00","2016-03-26 04:00:00"), Var2=c(56, 43, 23, 14, 51, 27, 89, 76, 56, 4, 35, 23, 4, 62, 84))
df2$datetime<- as.POSIXct(df2$datetime, format = "%Y-%m-%d %H", tz= "Europe/Paris")
df1
datetime Var1
1 2016-10-29 22:00:00 4
2 2016-10-29 23:00:00 56
3 2016-10-30 00:00:00 76
4 2016-10-30 01:00:00 54
5 2016-10-30 02:00:00 34
6 2016-10-30 03:00:00 3
7 2016-10-30 04:00:00 4
8 2016-10-30 05:00:00 6
9 2017-03-25 22:00:00 78
10 2017-03-25 23:00:00 23
11 2017-03-26 00:00:00 12
12 2017-03-26 01:00:00 3
13 2017-03-26 02:00:00 5
14 2017-03-26 03:00:00 6
15 2017-03-26 04:00:00 7
df2
datetime Var2
1 2016-10-29 22:00:00 56
2 2016-10-29 23:00:00 43
3 2016-10-30 00:00:00 23
4 2016-10-30 01:00:00 14
5 2016-10-30 02:00:00 51
6 2016-10-30 03:00:00 27
7 2016-10-30 04:00:00 89
8 2016-10-30 05:00:00 76
9 2017-03-25 22:00:00 56
10 2017-03-25 23:00:00 4
11 2017-03-26 00:00:00 35
12 2017-03-26 01:00:00 23
13 2017-03-26 02:00:00 4
14 2017-03-26 03:00:00 62
15 2017-03-26 04:00:00 84
When I change df2$datetime format from Europe/Paris to UTC, this happens:
library(lubridate)
df2$datetime<-with_tz(df2$datetime,"UTC")
df2
datetime Var2
1 2016-10-29 20:00:00 56
2 2016-10-29 21:00:00 43
3 2016-10-29 22:00:00 23
4 2016-10-29 23:00:00 14
5 2016-10-30 00:00:00 51
6 2016-10-30 02:00:00 27 # Data at 01:00:00 is missing
7 2016-10-30 03:00:00 89
8 2016-10-30 04:00:00 76
9 2017-03-25 21:00:00 56
10 2017-03-25 22:00:00 4
11 2017-03-25 23:00:00 35
12 2017-03-26 00:00:00 23
13 2017-03-26 00:00:00 4 # There is a duplicate at 00:00:00
14 2017-03-26 01:00:00 62
15 2017-03-26 02:00:00 84
16 2017-03-26 03:00:00 56
Is there another way to transform df2$datetime from Europe/Paris format to UTC format that allows me to merge two data frames without this problem of having either lost or duplicated data? I don't understand why I have to lose or duplicate info in df2.
Is the transformation I did right in df2$datetime in order to merge this data frame with df1? What I've done so far to solve this is to add a new row in df2 on 2016-10-30 at 01:00:00 that is the mean between 2016-10-30 00:00:00and 2016-10-30 02:00:00 and to remove one row on 2017-03-26 at 00:00:00.
Thanks for your help.
I found out that my original df2 should be like this:
df2
datetime Var1
1 2016-10-29 22:00:00 4 # This is time in format "GMT+2". It corresponds to 20:00 UTC
2 2016-10-29 23:00:00 56 # This is time in format "GMT+2". It corresponds to 21:00 UTC
3 2016-10-30 00:00:00 76 # This is time in format "GMT+2". It corresponds to 22:00 UTC
4 2016-10-30 01:00:00 54 # This is time in format "GMT+2". It corresponds to 23:00 UTC
5 2016-10-30 02:00:00 34 # This is time in format "GMT+2". It corresponds to 00:00 UTC
6 2016-10-30 02:00:00 3 # This is time in format "GMT+1". It corresponds to 01:00 UTC
7 2016-10-30 03:00:00 4 # This is time in format "GMT+1". It corresponds to 02:00 UTC
8 2016-10-30 04:00:00 6 # This is time in format "GMT+1". It corresponds to 03:00 UTC
9 2016-10-30 05:00:00 78 # This is time in format "GMT+1". It corresponds to 04:00 UTC
10 2017-03-25 22:00:00 23 # This is time in format "GMT+1". It corresponds to 21:00 UTC
11 2017-03-25 23:00:00 12 # This is time in format "GMT+1". It corresponds to 22:00 UTC
12 2017-03-26 00:00:00 3 # This is time in format "GMT+1". It corresponds to 23:00 UTC
13 2017-03-26 01:00:00 5 # This is time in format "GMT+1". It corresponds to 00:00 UTC
14 2017-03-26 03:00:00 6 # This is time in format "GMT+2". It corresponds to 01:00 UTC
15 2017-03-26 04:00:00 7 # This is time in format "GMT+2". It corresponds to 02:00 UTC
16 2017-03-26 05:00:00 76 # This is time in format "GMT+2". It corresponds to 03:00 UTC
However, my original df2 doesn't have duplicated or lost time data. It is like this:
df2
datetime Var1
1 2016-10-29 22:00:00 4
2 2016-10-29 23:00:00 56
3 2016-10-30 00:00:00 76
4 2016-10-30 01:00:00 54
5 2016-10-30 02:00:00 34
6 2016-10-30 03:00:00 3
7 2016-10-30 04:00:00 4
8 2016-10-30 05:00:00 6
9 2017-03-25 22:00:00 78
10 2017-03-25 23:00:00 23
11 2017-03-26 00:00:00 12
12 2017-03-26 01:00:00 3
13 2017-10-30 02:00:00 5
14 2017-03-26 03:00:00 6
15 2017-03-26 04:00:00 7
16 2017-03-26 05:00:00 76
When I applied the R code df2$datetime<-with_tz(df2$datetime,"UTC"), this happens:
df2
datetime Var1
1 2016-10-29 20:00:00 4
2 2016-10-29 21:00:00 56
3 2016-10-29 22:00:00 76
4 2016-10-29 23:00:00 54
5 2016-10-30 00:00:00 34
6 2016-10-30 02:00:00 3 # I have to add mannually a new row between the times "00:00" and "02:00"
7 2016-10-30 03:00:00 4
8 2016-10-30 04:00:00 6
9 2017-03-25 21:00:00 78
10 2017-03-25 22:00:00 23
11 2017-03-25 23:00:00 12
12 2017-03-26 00:00:00 3
13 2017-10-30 01:00:00 5 # I have to remove mannually one of the rows refered to the time "01:00".
14 2017-03-26 01:00:00 6
15 2017-03-26 02:00:00 7
16 2017-03-26 03:00:00 76
If my original df2 had one duplication at "02:00:00" on 30th Octover and a gap on 26th March between "01:00" and "03:00", I would get with the R code df2$datetime<-with_tz(df2$datetime,"UTC") this:
df2
datetime Var1
1 2016-10-29 20:00:00 4
2 2016-10-29 21:00:00 56
3 2016-10-29 22:00:00 76
4 2016-10-29 23:00:00 54
5 2016-10-30 00:00:00 34
6 2016-10-30 00:00:00 3 # I just have to change "00:00:00" for "01:00"
7 2016-10-30 02:00:00 4
8 2016-10-30 03:00:00 6
9 2016-10-30 04:00:00 78
10 2017-03-25 21:00:00 23
11 2017-03-25 22:00:00 12
12 2017-03-25 23:00:00 3
13 2017-03-26 00:00:00 5
14 2017-03-26 01:00:00 6
15 2017-03-26 02:00:00 7
16 2017-03-26 03:00:00 76
#As there are some Versions of df2 I use the one shown in the Question
df2 <- read.table(text = "
datetime Var2
1 '2016-10-29 22:00:00' 56
2 '2016-10-29 23:00:00' 43
3 '2016-10-30 00:00:00' 23
4 '2016-10-30 01:00:00' 14
5 '2016-10-30 02:00:00' 51
6 '2016-10-30 03:00:00' 27
7 '2016-10-30 04:00:00' 89
8 '2016-10-30 05:00:00' 76
9 '2017-03-25 22:00:00' 56
10 '2017-03-25 23:00:00' 4
11 '2017-03-26 00:00:00' 35
12 '2017-03-26 01:00:00' 23
13 '2017-03-26 02:00:00' 4
14 '2017-03-26 03:00:00' 62
15 '2017-03-26 04:00:00' 84
", header = TRUE)
library(lubridate)
#When you define now the timezone the content of df2 is already changed
df2$datetimeEP <- as.POSIXct(df2$datetime, format = "%Y-%m-%d %H", tz= "Europe/Paris")
#df2[13,]
# datetime Var2 datetimeEP
#13 2017-03-26 02:00:00 4 2017-03-26 01:00:00
#For me it looks like that your recorded times don't consider "daylight savings time".
#So your have to uses e.g. "Etc/GMT-1" instead of "Europe/Paris"
df2$datetimeG1 <- as.POSIXct(df2$datetime, format = "%Y-%m-%d %H", tz= "Etc/GMT-1")
data.frame(datetime=df2$datetime, utc=with_tz(df2$datetimeG1,"UTC"))
# datetime utc
#1 2016-10-29 22:00:00 2016-10-29 21:00:00
#2 2016-10-29 23:00:00 2016-10-29 22:00:00
#3 2016-10-30 00:00:00 2016-10-29 23:00:00
#4 2016-10-30 01:00:00 2016-10-30 00:00:00
#5 2016-10-30 02:00:00 2016-10-30 01:00:00
#6 2016-10-30 03:00:00 2016-10-30 02:00:00
#7 2016-10-30 04:00:00 2016-10-30 03:00:00
#8 2016-10-30 05:00:00 2016-10-30 04:00:00
#9 2017-03-25 22:00:00 2017-03-25 21:00:00
#10 2017-03-25 23:00:00 2017-03-25 22:00:00
#11 2017-03-26 00:00:00 2017-03-25 23:00:00
#12 2017-03-26 01:00:00 2017-03-26 00:00:00
#13 2017-03-26 02:00:00 2017-03-26 01:00:00
#14 2017-03-26 03:00:00 2017-03-26 02:00:00
#15 2017-03-26 04:00:00 2017-03-26 03:00:00
#You can use "dst" to see if datetime of a time zone has "daylight savings time"
dst(df2$datetimeEP)
dst(df2$datetimeG1)
dst(with_tz(df2$datetimeEP,"UTC"))
dst(with_tz(df2$datetimeG1,"UTC"))
#If your recorded times consider "daylight savings time" then you HAVE a gap and an overlap.
My dataset is a bit noisy at 1-min interval. So, I'd like to get an average value every hour from 25 min to 35 min to stand for that hour at 30 min.
For example, an average average at: 00:30 (average from 00:25 to 00:35), 01:30 (average from 01:25 to 01:35), 02:30 (average from 02:25 to 02:35), etc.
Can you good way to do this in R?
Here is my dataset:
set.seed(1)
DateTime <- seq(as.POSIXct("2010/1/1 00:00"), as.POSIXct("2010/1/5 00:00"), "min")
value <- rnorm(n=length(DateTime), mean=100, sd=1)
df <- data.frame(DateTime, value)
Thanks a lot.
Here's one way
library(dplyr)
df %>%
filter(between(as.numeric(format(DateTime, "%M")), 25, 35)) %>%
group_by(hour=format(DateTime, "%Y-%m-%d %H")) %>%
summarise(value=mean(value))
I think that the existing answers are not general enough as they do not take into account that a time interval could fall within multiple midpoints.
I would instead use shift from the data.table package.
library(data.table)
setDT(df)
First set the interval argument based on the sequence you chose above. This calculates an average ten rows (minutes) around every row in your table:
df[, ave_val :=
Reduce('+',c(shift(value, 0:5L, type = "lag"),shift(value, 1:5L, type = "lead")))/11
]
Then generate the midpoints you want:
mids <- seq(as.POSIXct("2010/1/1 00:00"), as.POSIXct("2010/1/5 00:00"), by = 60*60) + 30*60 # every hour starting at 0:30
Then filter accordingly:
setkey(df,DateTime)
df[J(mids)]
Since you want to average on just a subset of each period, I think it makes sense to first subset the data.frame, then aggregate:
aggregate(
value~cbind(time=strftime(DateTime,'%Y-%m-%d %H:30:00')),
subset(df,{ m <- strftime(DateTime,'%M'); m>='25' & m<='35'; }),
mean
);
## time value
## 1 2010-01-01 00:30:00 99.82317
## 2 2010-01-01 01:30:00 100.58184
## 3 2010-01-01 02:30:00 99.54985
## 4 2010-01-01 03:30:00 100.47238
## 5 2010-01-01 04:30:00 100.05517
## 6 2010-01-01 05:30:00 99.96252
## 7 2010-01-01 06:30:00 99.79512
## 8 2010-01-01 07:30:00 99.06791
## 9 2010-01-01 08:30:00 99.58731
## 10 2010-01-01 09:30:00 100.27202
## 11 2010-01-01 10:30:00 99.60758
## 12 2010-01-01 11:30:00 99.92074
## 13 2010-01-01 12:30:00 99.65819
## 14 2010-01-01 13:30:00 100.04202
## 15 2010-01-01 14:30:00 100.04461
## 16 2010-01-01 15:30:00 100.11609
## 17 2010-01-01 16:30:00 100.08631
## 18 2010-01-01 17:30:00 100.41956
## 19 2010-01-01 18:30:00 99.98065
## 20 2010-01-01 19:30:00 100.07341
## 21 2010-01-01 20:30:00 100.20281
## 22 2010-01-01 21:30:00 100.86013
## 23 2010-01-01 22:30:00 99.68170
## 24 2010-01-01 23:30:00 99.68097
## 25 2010-01-02 00:30:00 99.58603
## 26 2010-01-02 01:30:00 100.10178
## 27 2010-01-02 02:30:00 99.78766
## 28 2010-01-02 03:30:00 100.02220
## 29 2010-01-02 04:30:00 99.83427
## 30 2010-01-02 05:30:00 99.74934
## 31 2010-01-02 06:30:00 99.99594
## 32 2010-01-02 07:30:00 100.08257
## 33 2010-01-02 08:30:00 99.47077
## 34 2010-01-02 09:30:00 99.81419
## 35 2010-01-02 10:30:00 100.13294
## 36 2010-01-02 11:30:00 99.78352
## 37 2010-01-02 12:30:00 100.04590
## 38 2010-01-02 13:30:00 99.91061
## 39 2010-01-02 14:30:00 100.61730
## 40 2010-01-02 15:30:00 100.18539
## 41 2010-01-02 16:30:00 99.45165
## 42 2010-01-02 17:30:00 100.09894
## 43 2010-01-02 18:30:00 100.04131
## 44 2010-01-02 19:30:00 99.58399
## 45 2010-01-02 20:30:00 99.75524
## 46 2010-01-02 21:30:00 99.94079
## 47 2010-01-02 22:30:00 100.26533
## 48 2010-01-02 23:30:00 100.35354
## 49 2010-01-03 00:30:00 100.31141
## 50 2010-01-03 01:30:00 100.10709
## 51 2010-01-03 02:30:00 99.41102
## 52 2010-01-03 03:30:00 100.07964
## 53 2010-01-03 04:30:00 99.88183
## 54 2010-01-03 05:30:00 99.91112
## 55 2010-01-03 06:30:00 99.71431
## 56 2010-01-03 07:30:00 100.48585
## 57 2010-01-03 08:30:00 100.35096
## 58 2010-01-03 09:30:00 100.00060
## 59 2010-01-03 10:30:00 100.03858
## 60 2010-01-03 11:30:00 99.95713
## 61 2010-01-03 12:30:00 99.18699
## 62 2010-01-03 13:30:00 99.49216
## 63 2010-01-03 14:30:00 99.37762
## 64 2010-01-03 15:30:00 99.68642
## 65 2010-01-03 16:30:00 99.84921
## 66 2010-01-03 17:30:00 99.84039
## 67 2010-01-03 18:30:00 99.90989
## 68 2010-01-03 19:30:00 99.95421
## 69 2010-01-03 20:30:00 100.01276
## 70 2010-01-03 21:30:00 100.14585
## 71 2010-01-03 22:30:00 99.54110
## 72 2010-01-03 23:30:00 100.02526
## 73 2010-01-04 00:30:00 100.04476
## 74 2010-01-04 01:30:00 99.61132
## 75 2010-01-04 02:30:00 99.94782
## 76 2010-01-04 03:30:00 99.44863
## 77 2010-01-04 04:30:00 99.91305
## 78 2010-01-04 05:30:00 100.25428
## 79 2010-01-04 06:30:00 99.86279
## 80 2010-01-04 07:30:00 99.63516
## 81 2010-01-04 08:30:00 99.65747
## 82 2010-01-04 09:30:00 99.57810
## 83 2010-01-04 10:30:00 99.77603
## 84 2010-01-04 11:30:00 99.85140
## 85 2010-01-04 12:30:00 100.82995
## 86 2010-01-04 13:30:00 100.26138
## 87 2010-01-04 14:30:00 100.25851
## 88 2010-01-04 15:30:00 99.92685
## 89 2010-01-04 16:30:00 100.00825
## 90 2010-01-04 17:30:00 100.24437
## 91 2010-01-04 18:30:00 99.62711
## 92 2010-01-04 19:30:00 99.93999
## 93 2010-01-04 20:30:00 99.82477
## 94 2010-01-04 21:30:00 100.15321
## 95 2010-01-04 22:30:00 99.88370
## 96 2010-01-04 23:30:00 100.06657
I guess I don't even know really what to 'title' this question as.
But I think this is quite a common data manipulation requirement.
I have data that has a periodic exchange between two parties of a quantity of a good. The exchanges are made hourly. Here is an example data frame:
df <- cbind.data.frame(Seller = as.character(c("A","A","A","A","A","A")),
Buyer = c("B","B","B","C","C","C"),
DateTimeFrom = c("1/07/2013 0:00","1/07/2013 9:00","1/07/2013 0:00","1/07/2013 6:00","1/07/2013 8:00","2/07/2013 9:00"),
DateTimeTo = c("1/07/2013 8:00","1/07/2013 15:00","2/07/2013 8:00","1/07/2013 9:00","1/07/2013 12:00","2/07/2013 16:00"),
Qty = c(50,10,20,25,5,5)
)
df$DateTimeFrom <- as.POSIXct(df$DateTimeFrom, format = '%d/%m/%Y %H:%M', tz = 'GMT')
df$DateTimeTo <- as.POSIXct(df$DateTimeTo, format = '%d/%m/%Y %H:%M', tz = 'GMT')
> df
Seller Buyer DateTimeFrom DateTimeTo Qty
1 A B 2013-07-01 00:00:00 2013-07-01 08:00:00 50
2 A B 2013-07-01 09:00:00 2013-07-01 15:00:00 10
3 A B 2013-07-01 00:00:00 2013-07-02 08:00:00 20
4 A C 2013-07-01 06:00:00 2013-07-01 09:00:00 25
5 A C 2013-07-01 08:00:00 2013-07-01 12:00:00 5
6 A C 2013-07-02 09:00:00 2013-07-02 16:00:00 5
So, for example, the first row of this data frame says that the Seller "A" sells 50 units of the good to the buyer "B" every hour from midnight on 1/7/13 until 8am on 1/7/13. You can also notice that some of these exchanges between the same two parties can overlap, but just with a different negotiated quantity.
What I need to do (and need your help with) is to generate a sequence covering all hours over this two day period that sums the total quantity exchanged in that hour between two sellers over all neogociations.
Here would be the resulting dataframe.
DateTimeSeq <- data.frame(seq(ISOdate(2013,7,1,0),by = "hour", length.out = 48))
colnames(DateTimeSeq) <- c("DateTime")
#What the Answer should be
DateTimeSeq$QtyAB <- c(70,70,70,70,70,70,70,70,70,30,30,30,30,30,30,30,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,20,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
DateTimeSeq$QtyAC <- c(0,0,0,0,0,0,25,25,30,30,5,5,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,5,5,5,5,5,5,5,0,0,0,0,0,0,0)
> DateTimeSeq
DateTime QtyAB QtyAC
1 2013-07-01 00:00:00 70 0
2 2013-07-01 01:00:00 70 0
3 2013-07-01 02:00:00 70 0
4 2013-07-01 03:00:00 70 0
5 2013-07-01 04:00:00 70 0
6 2013-07-01 05:00:00 70 0
7 2013-07-01 06:00:00 70 25
8 2013-07-01 07:00:00 70 25
9 2013-07-01 08:00:00 70 30
10 2013-07-01 09:00:00 30 30
11 2013-07-01 10:00:00 30 5
12 2013-07-01 11:00:00 30 5
13 2013-07-01 12:00:00 30 5
14 2013-07-01 13:00:00 30 0
15 2013-07-01 14:00:00 30 0
.... etc
Anybody able to lend a hand?
Thanks,
A
Here is my solution which uses the dplyr and reshape package.
library(dplyr)
library(reshape)
Firstly, we should expand the dataframe so that everything is in an hourly format. This can be done using the do part of dplyr.
df %>% rowwise() %>%
do(data.frame(Seller=.$Seller,
Buyer=.$Buyer,
Qty=.$Qty,
DateTimeCurr=seq(from=.$DateTimeFrom, to=.$DateTimeTo, by="hour")))
Output:
Source: local data frame [66 x 4]
Groups: <by row>
Seller Buyer Qty DateTimeCurr
1 A B 50 2013-07-01 00:00:00
2 A B 50 2013-07-01 01:00:00
3 A B 50 2013-07-01 02:00:00
...
From there it is trivial to get the correct id's and summarise the total using the group_by function.
df1 <- df %>% rowwise() %>%
do(data.frame(Seller=.$Seller,
Buyer=.$Buyer,
Qty=.$Qty,
DateTimeCurr=seq(from=.$DateTimeFrom, to=.$DateTimeTo, by="hour"))) %>%
group_by(Seller, Buyer, DateTimeCurr) %>%
summarise(TotalQty=sum(Qty)) %>%
mutate(id=paste0("Qty", Seller, Buyer))
Output:
Source: local data frame [48 x 5]
Groups: Seller, Buyer
Seller Buyer DateTimeCurr TotalQty id
1 A B 2013-07-01 00:00:00 70 QtyAB
2 A B 2013-07-01 01:00:00 70 QtyAB
3 A B 2013-07-01 02:00:00 70 QtyAB
From this dataframe, all we have to do is cast it into the format you have above.
> cast(df1, DateTimeCurr~ id, value="TotalQty")
DateTimeCurr QtyAB QtyAC
1 2013-07-01 00:00:00 70 NA
2 2013-07-01 01:00:00 70 NA
3 2013-07-01 02:00:00 70 NA
4 2013-07-01 03:00:00 70 NA
5 2013-07-01 04:00:00 70 NA
6 2013-07-01 05:00:00 70 NA
So the whole piece of code
df1 <- df %>% rowwise() %>%
do(data.frame(Seller=.$Seller,
Buyer=.$Buyer,
Qty=.$Qty,
DateTimeCurr=seq(from=.$DateTimeFrom, to=.$DateTimeTo, by="hour"))) %>%
group_by(Seller, Buyer, DateTimeCurr) %>%
summarise(TotalQty=sum(Qty)) %>%
mutate(id=paste0("Qty", Seller, Buyer))
cast(df1, DateTimeCurr~ id, value="TotalQty")
I have some observed data by hour. I am trying to subset this data by the day or even week intervals. I am not sure how to proceed with this task in R.
The sample of the data is below.
date obs
2011-10-24 01:00:00 12
2011-10-24 02:00:00 4
2011-10-24 19:00:00 18
2011-10-24 20:00:00 7
2011-10-24 21:00:00 4
2011-10-24 22:00:00 2
2011-10-25 00:00:00 4
2011-10-25 01:00:00 2
2011-10-25 02:00:00 2
2011-10-25 15:00:00 12
2011-10-25 18:00:00 2
2011-10-25 19:00:00 3
2011-10-25 21:00:00 2
2011-10-25 23:00:00 9
2011-10-26 00:00:00 13
2011-10-26 01:00:00 11
First I entered the data with the multiple spaces replaced with tabs.
dat$date <- as.POSIXct(dat$date, format="%Y-%m-%d %H:%M:%S")
split(dat , as.POSIXlt(dat$date)$yday)
# Notice these are not the same functions
#---------------------
$`296`
date obs
1 2011-10-24 01:00:00 12
2 2011-10-24 02:00:00 4
3 2011-10-24 19:00:00 18
4 2011-10-24 20:00:00 7
5 2011-10-24 21:00:00 4
6 2011-10-24 22:00:00 2
$`297`
date obs
7 2011-10-25 00:00:00 4
8 2011-10-25 01:00:00 2
9 2011-10-25 02:00:00 2
10 2011-10-25 15:00:00 12
11 2011-10-25 18:00:00 2
12 2011-10-25 19:00:00 3
13 2011-10-25 21:00:00 2
14 2011-10-25 23:00:00 9
$`298`
date obs
15 2011-10-26 00:00:00 13
16 2011-10-26 01:00:00 11
The POSIXlt class does not work well inside dataframes but it can ve very handy for creating time based groups. It's a list structure with these indices: 'yday', 'wday', 'year', 'mon', 'mday', 'hour', 'min', 'sec' and 'isdt'. The cut.POSIXt function adds divisions at other natural boundaries; E.g.
?cut.POSIXt
split(dat , cut(dat$date, "week") )
If you wanted to sum within date:
tapply(dat$obs, as.POSIXlt(dat$date)$yday, sum)
#-------
296 297 298
47 36 24
I'd use a time series class such as xts
dat <- read.table(text="2011-10-24 01:00:00 12
2011-10-24 02:00:00 4
2011-10-24 19:00:00 18
2011-10-24 20:00:00 7
2011-10-24 21:00:00 4
2011-10-24 22:00:00 2
2011-10-25 00:00:00 4
2011-10-25 01:00:00 2
2011-10-25 02:00:00 2
2011-10-25 15:00:00 12
2011-10-25 18:00:00 2
2011-10-25 19:00:00 3
2011-10-25 21:00:00 2
2011-10-25 23:00:00 9
2011-10-26 00:00:00 13
2011-10-26 01:00:00 11", header=FALSE, stringsAsFactors=FALSE)
xobj <- xts(dat[, 3], as.POSIXct(paste(dat[, 1], dat[, 2])))
xts subsetting is very intuitive. For all data on "2011-10-25", do this
xobj["2011-10-25"]
# [,1]
#2011-10-25 00:00:00 4
#2011-10-25 01:00:00 2
#2011-10-25 02:00:00 2
#2011-10-25 15:00:00 12
#2011-10-25 18:00:00 2
#2011-10-25 19:00:00 3
#2011-10-25 21:00:00 2
#2011-10-25 23:00:00 9
You can also subset out time spans like this (all data between and including 2011-10-24 and 2011-10-25)
xobj["2011-10-24/2011-10-25"]
Or, if you want all data from October 2011,
xobj["2011-10"]
If you want to get all data from any day that is between 19:00 and 20:00,
xobj['T19:00:00/T20:00:00']
# [,1]
#2011-10-24 19:00:00 18
#2011-10-24 20:00:00 7
#2011-10-25 19:00:00 3
You can use the endpoints function to find the rows that are the last rows of a time period ("hours", "days", "weeks", etc.)
endpoints(xobj, "days")
[1] 0 6 14 16
Or you can convert to a lower frequency
to.weekly(xobj)
# xobj.Open xobj.High xobj.Low xobj.Close
#2011-10-26 12 18 2 11
to.daily(xobj)
# xobj.Open xobj.High xobj.Low xobj.Close
#2011-10-25 12 18 2 2
#2011-10-26 4 12 2 9
#2011-10-26 13 13 11 11
Notice that the above creates columns for Open, High, Low, and Close. If you only want the data at the endpoints, you can use OHLC=FALSE
to.daily(xobj, OHLC=FALSE)
# [,1]
#2011-10-25 2
#2011-10-26 9
#2011-10-26 11
For more basic subsetting, and much more, visit http://www.quantmod.com/examples/
As #JoshuaUlrich mentions in the comments, split.xts is INCREDIBLY useful.
You can split by day (or week, or month, etc), apply a function, then recombine
split(xobj, 'days') #create a list where each element is the data for a different day
#[[1]]
# [,1]
#2011-10-24 01:00:00 12
#2011-10-24 02:00:00 4
#2011-10-24 19:00:00 18
#2011-10-24 20:00:00 7
#2011-10-24 21:00:00 4
#2011-10-24 22:00:00 2
#
#[[2]]
# [,1]
#2011-10-25 00:00:00 4
#2011-10-25 01:00:00 2
#2011-10-25 02:00:00 2
#2011-10-25 15:00:00 12
#2011-10-25 18:00:00 2
#2011-10-25 19:00:00 3
#2011-10-25 21:00:00 2
#2011-10-25 23:00:00 9
#
#[[3]]
# [,1]
#2011-10-26 00:00:00 13
#2011-10-26 01:00:00 11
Suppose you want only the first value of each day. split by day, lapply the first function and rbind back together.
do.call(rbind, lapply(split(xobj, 'days'), first))
# [,1]
#2011-10-24 01:00:00 12
#2011-10-25 00:00:00 4
#2011-10-26 00:00:00 13