as.POSIXct gives inexplicable NA value [duplicate] - r

This question already has answers here:
How do I clear an NA flag for a posix value?
(3 answers)
Closed 5 years ago.
I have a large dataset (21683 records) and I've managed to combine date and time to datetime in a correct way using asPOSIXct. Nevertheless, this did not work for 6 records (17463:17468). This is the dataset I'm using:
> head(solar.angle)
Date Time sol.elev.angle ID Datetime
1 2016-11-24 15:00:00 41.32397 1 2016-11-24 15:00:00
2 2016-11-24 15:10:00 39.11225 2 2016-11-24 15:10:00
3 2016-11-24 15:20:00 36.88180 3 2016-11-24 15:20:00
4 2016-11-24 15:30:00 34.63507 4 2016-11-24 15:30:00
5 2016-11-24 15:40:00 32.37418 5 2016-11-24 15:40:00
6 2016-11-24 15:50:00 30.10096 6 2016-11-24 15:50:00
> solar.angle[17460:17470,]
Date Time sol.elev.angle ID Datetime
17488 2017-03-26 01:30:00 -72.01821 17460 2017-03-26 01:30:00
17489 2017-03-26 01:40:00 -69.53832 17461 2017-03-26 01:40:00
17490 2017-03-26 01:50:00 -67.05409 17462 2017-03-26 01:50:00
17491 2017-03-26 02:00:00 -64.56682 17463 <NA>
17492 2017-03-26 02:10:00 -62.07730 17464 <NA>
17493 2017-03-26 02:20:00 -59.58609 17465 <NA>
17494 2017-03-26 02:30:00 -57.09359 17466 <NA>
17495 2017-03-26 02:40:00 -54.60006 17467 <NA>
17496 2017-03-26 02:50:00 -52.10572 17468 <NA>
17497 2017-03-26 03:00:00 -49.61071 17469 2017-03-26 03:00:00
17498 2017-03-26 03:10:00 -47.11515 17470 2017-03-26 03:10:00
This is the code I'm using:
solar.angle$Datetime <- as.POSIXct(paste(solar.angle$Date,solar.angle$Time), format="%Y-%m-%d %H:%M:%S")
I've already tried to fill them in manually but this did not make any difference:
> solar.angle$Datetime[17463] <- as.POSIXct('2017-03-26 02:00:00', format = "%Y-%m-%d %H:%M:%S")
> solar.angle$Datetime[17463]
[1] NA
Any help will be appreciated!

The problem here is that this is the time you switch to summer time, so you need to specify the time zone, otherwise there is ambiguity.
If you specify a time zone, it will work:
as.POSIXct('2017-03-26 02:00:00', format = "%Y-%m-%d %H:%M:%S", tz = "GMT")
Which returns:
"2017-03-26 02:00:00 GMT"
You can check ?timezones for more information.

Related

na.approx function does not produce correct timestamps

I have a large dataset of electric load data with a missing timestamp for the last Sunday of March of each year due to daylight saving time. I have copied below a few rows containing a missing timestamp.
structure(list(Date_Time = structure(c(1427569200, 1427572800,
1427576400, 1427580000, 1427583600, 1427587200, NA, 1427590800,
1427594400, 1427598000, 1427601600, 1427605200), tzone = "EET", class = c("POSIXct",
"POSIXt")), Day_ahead_Load = c("7139", "6598", "6137", "5177",
"4728", "4628", "N/A", "4426", "4326", "4374", "4546", "4885"
), Actual_Load = c(6541, 6020, 5602, 5084, 4640, 4593, NA, 4353,
NA, NA, 4333, 4556)), row.names = c(NA, -12L), class = "data.frame")
#> Date_Time Day_ahead_Load Actual_Load
#> 1 2015-03-28 21:00:00 7139 6541
#> 2 2015-03-28 22:00:00 6598 6020
#> 3 2015-03-28 23:00:00 6137 5602
#> 4 2015-03-29 00:00:00 5177 5084
#> 5 2015-03-29 01:00:00 4728 4640
#> 6 2015-03-29 02:00:00 4628 4593
#> 7 <NA> N/A NA
#> 8 2015-03-29 04:00:00 4426 4353
#> 9 2015-03-29 05:00:00 4326 NA
#> 10 2015-03-29 06:00:00 4374 NA
#> 11 2015-03-29 07:00:00 4546 4333
#> 12 2015-03-29 08:00:00 4885 4556
I have tried to fill these missing timestamps using na.approx, but the function returns "2015-03-29 02:30:00", instead of "2015-03-29 03:00:00". It does not use the correct scale.
mydata$Date_Time <- as.POSIXct(na.approx(mydata$Date_Time), origin = "1970-01-01 00:00:00", tz = "EET")
#> Date_Time Day_ahead_Load Actual_Load
#> 1 2015-03-28 21:00:00 7139 6541
#> 2 2015-03-28 22:00:00 6598 6020
#> 3 2015-03-28 23:00:00 6137 5602
#> 4 2015-03-29 00:00:00 5177 5084
#> 5 2015-03-29 01:00:00 4728 4640
#> 6 2015-03-29 02:00:00 4628 4593
#> 7 2015-03-29 02:30:00 N/A NA
#> 8 2015-03-29 04:00:00 4426 4353
#> 9 2015-03-29 05:00:00 4326 NA
#> 10 2015-03-29 06:00:00 4374 NA
#> 11 2015-03-29 07:00:00 4546 4333
#> 12 2015-03-29 08:00:00 4885 4556
I have also tried using some other functions, such as "fill", but none of them works properly.
As I am fairly new to R, I would really appreciate any suggestions for filling the missing timestamps. Thank you in advance.
Actually the answer is correct. There is only one hour difference between the 6th and 8th rows due to the change from standard time to daylight savings time.
Use GMT (or equivalently UTC) if you intended that there be 2 hours between those rows. Below we use the same date and time as a character string but change the timezone to GMT to avoid daylight savings time changes.
diff(mydata[c(6, 8), 1])
## Time difference of 1 hours
# use GMT
tt <- as.POSIXct(format(mydata[[1]]), tz = "GMT")
as.POSIXct(na.approx(tt), tz = "GMT", origin = "1970-01-01")
## [1] "2015-03-28 21:00:00 GMT" "2015-03-28 22:00:00 GMT"
## [3] "2015-03-28 23:00:00 GMT" "2015-03-29 00:00:00 GMT"
## [5] "2015-03-29 01:00:00 GMT" "2015-03-29 02:00:00 GMT"
## [7] "2015-03-29 03:00:00 GMT" "2015-03-29 04:00:00 GMT"
## [9] "2015-03-29 05:00:00 GMT" "2015-03-29 06:00:00 GMT"
## [11] "2015-03-29 07:00:00 GMT" "2015-03-29 08:00:00 GMT"
You could use the following loop which would ensure that you always get the correct answer, even if you have many NA's following each other in the data.
library(lubridate)
dat$Date_Time <- as_datetime(as.character(dat$Date_Time))
dat$id <- 1:nrow(dat)
dat$previoustime <- NA
dat$timediff <- NA
for( i in 2:nrow(dat)) {
previousdateinds <- which(!is.na(dat$Date_Time) & dat$id < i)
previousdateind <- tail(previousdateinds,1)
dat$timediff[i] <- i-previousdateind # number of rows between this row and the last non-NA time
dat$previoustime[i] <- as.character(dat$Date_Time)[previousdateind]
print(previousdateind)
}
dat$previoustime <- as_datetime(dat$previoustime)
dat$result <- ifelse(is.na(dat$Date_Time), as.character(dat$previoustime+dat$timediff*60*60),
as.character(dat$Date_Time))
dat[6:8,]
Date_Time Day_ahead_Load Actual_Load id previoustime timediff result
6 2015-03-29 02:00:00 4628 4593 6 2015-03-29 01:00:00 1 2015-03-29 02:00:00
7 <NA> N/A NA 7 2015-03-29 02:00:00 1 2015-03-29 03:00:00
8 2015-03-29 04:00:00 4426 4353 8 2015-03-29 02:00:00 2 2015-03-29 04:00:00

How to combine Date from one column and Time from another?

I imported some data from Excel that has separate columns for "Date" and "Time". When I imported the "Time" column, it returned with 1899-12-31 19:00:00 with the date 1899-12-31 for every single time value.
I would like to create a new column that would combine the date from the "Date" column and time from the "Time" column so I can do some calculations.
# A tibble: 207 x 2
DoS ToS
<dttm> <dttm>
1 2018-01-27 00:00:00 1899-12-31 19:00:00
2 2018-02-07 00:00:00 1899-12-31 15:45:00
3 2018-02-13 00:00:00 1899-12-31 23:00:00
4 2018-02-15 00:00:00 1899-12-31 13:45:00
5 2018-02-16 00:00:00 1899-12-31 10:00:00
6 2018-02-19 00:00:00 1899-12-31 15:00:00
7 2018-02-20 00:00:00 1899-12-31 15:05:00
8 2018-02-21 00:00:00 1899-12-31 15:00:00
> dput(head(sample, 10))
structure(list(DoS = structure(c(1517011200, 1517961600, 1518480000,
1518652800, 1518739200, 1518998400, 1519084800, 1519171200, 1519257600,
1519862400), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
ToS = structure(c(-2209006800, -2209018500, -2208992400,
-2209025700, -2209039200, -2209021200, -2209020900, -2209021200,
-2209033800, -2209005000), class = c("POSIXct", "POSIXt"), tzone = "UTC")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L))
Is there some way I can extract the time values and paste it to the Date column?
Using base R, we can extract date from DoS and time from ToS and combine them together.
transform(sample, Datetime = as.POSIXct(paste(as.Date(DoS), format(ToS, "%T"))))
# DoS ToS Datetime
#1 2018-01-27 1899-12-31 19:00:00 2018-01-27 19:00:00
#2 2018-02-07 1899-12-31 15:45:00 2018-02-07 15:45:00
#3 2018-02-13 1899-12-31 23:00:00 2018-02-13 23:00:00
#4 2018-02-15 1899-12-31 13:45:00 2018-02-15 13:45:00
#5 2018-02-16 1899-12-31 10:00:00 2018-02-16 10:00:00
#6 2018-02-19 1899-12-31 15:00:00 2018-02-19 15:00:00
#7 2018-02-20 1899-12-31 15:05:00 2018-02-20 15:05:00
#8 2018-02-21 1899-12-31 15:00:00 2018-02-21 15:00:00
#9 2018-02-22 1899-12-31 11:30:00 2018-02-22 11:30:00
#10 2018-03-01 1899-12-31 19:30:00 2018-03-01 19:30:00

How to transform a datetime column from a `Non UTC` format to `UTC` format without loosing data the days in which there is a time change in R

I have a data frame df1 with a datetime column in format UTC. I need to merge this dataframe with the data frame df2 by the column datetime. My problem is that df2 is in Europe/Paris format, and when I transform df2$datetime from Europe/Paris to UTC format, I lose or duplicate data at the moments in which is the time change between either summer/winter or winter/summer. As an example:
df1<- data.frame(datetime=c("2016-10-29 22:00:00","2016-10-29 23:00:00","2016-10-30 00:00:00","2016-10-30 01:00:00","2016-10-30 02:00:00","2016-10-30 03:00:00","2016-10-30 04:00:00","2016-10-30 05:00:00","2016-03-25 22:00:00","2016-03-25 23:00:00","2016-03-26 00:00:00","2016-03-26 01:00:00","2016-03-26 02:00:00","2016-03-26 03:00:00","2016-03-26 04:00:00"), Var1= c(4, 56, 76, 54, 34, 3, 4, 6, 78, 23, 12, 3, 5, 6, 7))
df1$datetime<- as.POSIXct(df1$datetime, format = "%Y-%m-%d %H", tz= "UTC")
df2<- data.frame(datetime=c("2016-10-29 22:00:00","2016-10-29 23:00:00","2016-10-30 00:00:00","2016-10-30 01:00:00","2016-10-30 02:00:00","2016-10-30 03:00:00","2016-10-30 04:00:00","2016-10-30 05:00:00","2016-03-25 22:00:00","2016-03-25 23:00:00","2016-03-26 00:00:00","2016-03-26 01:00:00","2016-03-26 02:00:00","2016-03-26 03:00:00","2016-03-26 04:00:00"), Var2=c(56, 43, 23, 14, 51, 27, 89, 76, 56, 4, 35, 23, 4, 62, 84))
df2$datetime<- as.POSIXct(df2$datetime, format = "%Y-%m-%d %H", tz= "Europe/Paris")
df1
datetime Var1
1 2016-10-29 22:00:00 4
2 2016-10-29 23:00:00 56
3 2016-10-30 00:00:00 76
4 2016-10-30 01:00:00 54
5 2016-10-30 02:00:00 34
6 2016-10-30 03:00:00 3
7 2016-10-30 04:00:00 4
8 2016-10-30 05:00:00 6
9 2017-03-25 22:00:00 78
10 2017-03-25 23:00:00 23
11 2017-03-26 00:00:00 12
12 2017-03-26 01:00:00 3
13 2017-03-26 02:00:00 5
14 2017-03-26 03:00:00 6
15 2017-03-26 04:00:00 7
df2
datetime Var2
1 2016-10-29 22:00:00 56
2 2016-10-29 23:00:00 43
3 2016-10-30 00:00:00 23
4 2016-10-30 01:00:00 14
5 2016-10-30 02:00:00 51
6 2016-10-30 03:00:00 27
7 2016-10-30 04:00:00 89
8 2016-10-30 05:00:00 76
9 2017-03-25 22:00:00 56
10 2017-03-25 23:00:00 4
11 2017-03-26 00:00:00 35
12 2017-03-26 01:00:00 23
13 2017-03-26 02:00:00 4
14 2017-03-26 03:00:00 62
15 2017-03-26 04:00:00 84
When I change df2$datetime format from Europe/Paris to UTC, this happens:
library(lubridate)
df2$datetime<-with_tz(df2$datetime,"UTC")
df2
datetime Var2
1 2016-10-29 20:00:00 56
2 2016-10-29 21:00:00 43
3 2016-10-29 22:00:00 23
4 2016-10-29 23:00:00 14
5 2016-10-30 00:00:00 51
6 2016-10-30 02:00:00 27 # Data at 01:00:00 is missing
7 2016-10-30 03:00:00 89
8 2016-10-30 04:00:00 76
9 2017-03-25 21:00:00 56
10 2017-03-25 22:00:00 4
11 2017-03-25 23:00:00 35
12 2017-03-26 00:00:00 23
13 2017-03-26 00:00:00 4 # There is a duplicate at 00:00:00
14 2017-03-26 01:00:00 62
15 2017-03-26 02:00:00 84
16 2017-03-26 03:00:00 56
Is there another way to transform df2$datetime from Europe/Paris format to UTC format that allows me to merge two data frames without this problem of having either lost or duplicated data? I don't understand why I have to lose or duplicate info in df2.
Is the transformation I did right in df2$datetime in order to merge this data frame with df1? What I've done so far to solve this is to add a new row in df2 on 2016-10-30 at 01:00:00 that is the mean between 2016-10-30 00:00:00and 2016-10-30 02:00:00 and to remove one row on 2017-03-26 at 00:00:00.
Thanks for your help.
I found out that my original df2 should be like this:
df2
datetime Var1
1 2016-10-29 22:00:00 4 # This is time in format "GMT+2". It corresponds to 20:00 UTC
2 2016-10-29 23:00:00 56 # This is time in format "GMT+2". It corresponds to 21:00 UTC
3 2016-10-30 00:00:00 76 # This is time in format "GMT+2". It corresponds to 22:00 UTC
4 2016-10-30 01:00:00 54 # This is time in format "GMT+2". It corresponds to 23:00 UTC
5 2016-10-30 02:00:00 34 # This is time in format "GMT+2". It corresponds to 00:00 UTC
6 2016-10-30 02:00:00 3 # This is time in format "GMT+1". It corresponds to 01:00 UTC
7 2016-10-30 03:00:00 4 # This is time in format "GMT+1". It corresponds to 02:00 UTC
8 2016-10-30 04:00:00 6 # This is time in format "GMT+1". It corresponds to 03:00 UTC
9 2016-10-30 05:00:00 78 # This is time in format "GMT+1". It corresponds to 04:00 UTC
10 2017-03-25 22:00:00 23 # This is time in format "GMT+1". It corresponds to 21:00 UTC
11 2017-03-25 23:00:00 12 # This is time in format "GMT+1". It corresponds to 22:00 UTC
12 2017-03-26 00:00:00 3 # This is time in format "GMT+1". It corresponds to 23:00 UTC
13 2017-03-26 01:00:00 5 # This is time in format "GMT+1". It corresponds to 00:00 UTC
14 2017-03-26 03:00:00 6 # This is time in format "GMT+2". It corresponds to 01:00 UTC
15 2017-03-26 04:00:00 7 # This is time in format "GMT+2". It corresponds to 02:00 UTC
16 2017-03-26 05:00:00 76 # This is time in format "GMT+2". It corresponds to 03:00 UTC
However, my original df2 doesn't have duplicated or lost time data. It is like this:
df2
datetime Var1
1 2016-10-29 22:00:00 4
2 2016-10-29 23:00:00 56
3 2016-10-30 00:00:00 76
4 2016-10-30 01:00:00 54
5 2016-10-30 02:00:00 34
6 2016-10-30 03:00:00 3
7 2016-10-30 04:00:00 4
8 2016-10-30 05:00:00 6
9 2017-03-25 22:00:00 78
10 2017-03-25 23:00:00 23
11 2017-03-26 00:00:00 12
12 2017-03-26 01:00:00 3
13 2017-10-30 02:00:00 5
14 2017-03-26 03:00:00 6
15 2017-03-26 04:00:00 7
16 2017-03-26 05:00:00 76
When I applied the R code df2$datetime<-with_tz(df2$datetime,"UTC"), this happens:
df2
datetime Var1
1 2016-10-29 20:00:00 4
2 2016-10-29 21:00:00 56
3 2016-10-29 22:00:00 76
4 2016-10-29 23:00:00 54
5 2016-10-30 00:00:00 34
6 2016-10-30 02:00:00 3 # I have to add mannually a new row between the times "00:00" and "02:00"
7 2016-10-30 03:00:00 4
8 2016-10-30 04:00:00 6
9 2017-03-25 21:00:00 78
10 2017-03-25 22:00:00 23
11 2017-03-25 23:00:00 12
12 2017-03-26 00:00:00 3
13 2017-10-30 01:00:00 5 # I have to remove mannually one of the rows refered to the time "01:00".
14 2017-03-26 01:00:00 6
15 2017-03-26 02:00:00 7
16 2017-03-26 03:00:00 76
If my original df2 had one duplication at "02:00:00" on 30th Octover and a gap on 26th March between "01:00" and "03:00", I would get with the R code df2$datetime<-with_tz(df2$datetime,"UTC") this:
df2
datetime Var1
1 2016-10-29 20:00:00 4
2 2016-10-29 21:00:00 56
3 2016-10-29 22:00:00 76
4 2016-10-29 23:00:00 54
5 2016-10-30 00:00:00 34
6 2016-10-30 00:00:00 3 # I just have to change "00:00:00" for "01:00"
7 2016-10-30 02:00:00 4
8 2016-10-30 03:00:00 6
9 2016-10-30 04:00:00 78
10 2017-03-25 21:00:00 23
11 2017-03-25 22:00:00 12
12 2017-03-25 23:00:00 3
13 2017-03-26 00:00:00 5
14 2017-03-26 01:00:00 6
15 2017-03-26 02:00:00 7
16 2017-03-26 03:00:00 76
#As there are some Versions of df2 I use the one shown in the Question
df2 <- read.table(text = "
datetime Var2
1 '2016-10-29 22:00:00' 56
2 '2016-10-29 23:00:00' 43
3 '2016-10-30 00:00:00' 23
4 '2016-10-30 01:00:00' 14
5 '2016-10-30 02:00:00' 51
6 '2016-10-30 03:00:00' 27
7 '2016-10-30 04:00:00' 89
8 '2016-10-30 05:00:00' 76
9 '2017-03-25 22:00:00' 56
10 '2017-03-25 23:00:00' 4
11 '2017-03-26 00:00:00' 35
12 '2017-03-26 01:00:00' 23
13 '2017-03-26 02:00:00' 4
14 '2017-03-26 03:00:00' 62
15 '2017-03-26 04:00:00' 84
", header = TRUE)
library(lubridate)
#When you define now the timezone the content of df2 is already changed
df2$datetimeEP <- as.POSIXct(df2$datetime, format = "%Y-%m-%d %H", tz= "Europe/Paris")
#df2[13,]
# datetime Var2 datetimeEP
#13 2017-03-26 02:00:00 4 2017-03-26 01:00:00
#For me it looks like that your recorded times don't consider "daylight savings time".
#So your have to uses e.g. "Etc/GMT-1" instead of "Europe/Paris"
df2$datetimeG1 <- as.POSIXct(df2$datetime, format = "%Y-%m-%d %H", tz= "Etc/GMT-1")
data.frame(datetime=df2$datetime, utc=with_tz(df2$datetimeG1,"UTC"))
# datetime utc
#1 2016-10-29 22:00:00 2016-10-29 21:00:00
#2 2016-10-29 23:00:00 2016-10-29 22:00:00
#3 2016-10-30 00:00:00 2016-10-29 23:00:00
#4 2016-10-30 01:00:00 2016-10-30 00:00:00
#5 2016-10-30 02:00:00 2016-10-30 01:00:00
#6 2016-10-30 03:00:00 2016-10-30 02:00:00
#7 2016-10-30 04:00:00 2016-10-30 03:00:00
#8 2016-10-30 05:00:00 2016-10-30 04:00:00
#9 2017-03-25 22:00:00 2017-03-25 21:00:00
#10 2017-03-25 23:00:00 2017-03-25 22:00:00
#11 2017-03-26 00:00:00 2017-03-25 23:00:00
#12 2017-03-26 01:00:00 2017-03-26 00:00:00
#13 2017-03-26 02:00:00 2017-03-26 01:00:00
#14 2017-03-26 03:00:00 2017-03-26 02:00:00
#15 2017-03-26 04:00:00 2017-03-26 03:00:00
#You can use "dst" to see if datetime of a time zone has "daylight savings time"
dst(df2$datetimeEP)
dst(df2$datetimeG1)
dst(with_tz(df2$datetimeEP,"UTC"))
dst(with_tz(df2$datetimeG1,"UTC"))
#If your recorded times consider "daylight savings time" then you HAVE a gap and an overlap.

Erase space in splitting - R

I have a dataframe where I splitted the datetime column by date and time (two columns). However, when I group by time it gives me duplicates in time. So, to analyze it I used table() on time column, and it gave me duplicates also. This is a sample of it:
> table(df$time)
00:00:00 00:00:00 00:15:00 00:15:00 00:30:00 00:30:00
2211 1047 2211 1047 2211 1047
As you may see, when I splitted one of the "unique" values kept a " " inside. Is there a easy way to solve this?
PS: The datatype of the time column is character.
EDIT: Code added
df$datetime <- as.character.Date(df$datetime)
x <- colsplit(df$datetime, ' ', names = c('Date','Time'))
df <- cbind(df, x)
There are a number of approaches. One of them is to use appropriate functions to extract Dates and Times from Datetime column:
df <- data.frame(datetime = seq(
from=as.POSIXct("2018-5-15 0:00", tz="UTC"),
to=as.POSIXct("2018-5-16 24:00", tz="UTC"),
by="30 min") )
head(df$datetime)
#[1] "2018-05-15 00:00:00 UTC" "2018-05-15 00:30:00 UTC" "2018-05-15 01:00:00 UTC" "2018-05-15 01:30:00 UTC"
#[5] "2018-05-15 02:00:00 UTC" "2018-05-15 02:30:00 UTC"
df$Date <- as.Date(df$datetime)
df$Time <- format(df$datetime,"%H:%M:%S")
head(df)
# datetime Date Time
# 1 2018-05-15 00:00:00 2018-05-15 00:00:00
# 2 2018-05-15 00:30:00 2018-05-15 00:30:00
# 3 2018-05-15 01:00:00 2018-05-15 01:00:00
# 4 2018-05-15 01:30:00 2018-05-15 01:30:00
# 5 2018-05-15 02:00:00 2018-05-15 02:00:00
# 6 2018-05-15 02:30:00 2018-05-15 02:30:00
table(df$Time)
#00:00:00 00:30:00 01:00:00 01:30:00 02:00:00 02:30:00 03:00:00 03:30:00 04:00:00 04:30:00 05:00:00 05:30:00
#3 2 2 2 2 2 2 2 2 2 2 2
#06:00:00 06:30:00 07:00:00 07:30:00 08:00:00 08:30:00 09:00:00 09:30:00 10:00:00 10:30:00 11:00:00 11:30:00
#2 2 2 2 2 2 2 2 2 2 2 2
#12:00:00 12:30:00 13:00:00 13:30:00 14:00:00 14:30:00 15:00:00 15:30:00 16:00:00 16:30:00 17:00:00 17:30:00
#2 2 2 2 2 2 2 2 2 2 2 2
#18:00:00 18:30:00 19:00:00 19:30:00 20:00:00 20:30:00 21:00:00 21:30:00 22:00:00 22:30:00 23:00:00 23:30:00
#2 2 2 2 2 2 2 2 2 2 2 2
#If the data were given as character strings and contain extra spaces the above approach will still work
df <- data.frame(datetime=c("2018-05-15 00:00:00","2018-05-15 00:30:00",
"2018-05-15 01:00:00", "2018-05-15 02:00:00",
"2018-05-15 00:00:00","2018-05-15 00:30:00"),
stringsAsFactors=FALSE)
df$Date <- as.Date(df$datetime)
df$Time <- format(as.POSIXct(df$datetime, tz="UTC"),"%H:%M:%S")
head(df)
# datetime Date Time
# 1 2018-05-15 00:00:00 2018-05-15 00:00:00
# 2 2018-05-15 00:30:00 2018-05-15 00:30:00
# 3 2018-05-15 01:00:00 2018-05-15 01:00:00
# 4 2018-05-15 02:00:00 2018-05-15 02:00:00
# 5 2018-05-15 00:00:00 2018-05-15 00:00:00
# 6 2018-05-15 00:30:00 2018-05-15 00:30:00
table(df$Time)
#00:00:00 00:30:00 01:00:00 02:00:00
# 2 2 1 1
reshape2::colsplit accepts regular expressions, so you could split on '\s+' which matches 1 or more whitespace characters.
You can find out more about regular expressions in R using ?base::regex. The syntax is generally constant between languages, so you can use pretty much any regex tutorial. Take a look at https://regex101.com/. This site evaluates your regular expressions in real time and shows you exactly what each part is matching. It is extremely helpful!
Keep in mind that in R, as compared to most other languages, you must double the number of backslashes \. So \s (to match 1 whitespace character) must be written as \\s in R.

Partitioning data set by time intervals in R

I have some observed data by hour. I am trying to subset this data by the day or even week intervals. I am not sure how to proceed with this task in R.
The sample of the data is below.
date obs
2011-10-24 01:00:00 12
2011-10-24 02:00:00 4
2011-10-24 19:00:00 18
2011-10-24 20:00:00 7
2011-10-24 21:00:00 4
2011-10-24 22:00:00 2
2011-10-25 00:00:00 4
2011-10-25 01:00:00 2
2011-10-25 02:00:00 2
2011-10-25 15:00:00 12
2011-10-25 18:00:00 2
2011-10-25 19:00:00 3
2011-10-25 21:00:00 2
2011-10-25 23:00:00 9
2011-10-26 00:00:00 13
2011-10-26 01:00:00 11
First I entered the data with the multiple spaces replaced with tabs.
dat$date <- as.POSIXct(dat$date, format="%Y-%m-%d %H:%M:%S")
split(dat , as.POSIXlt(dat$date)$yday)
# Notice these are not the same functions
#---------------------
$`296`
date obs
1 2011-10-24 01:00:00 12
2 2011-10-24 02:00:00 4
3 2011-10-24 19:00:00 18
4 2011-10-24 20:00:00 7
5 2011-10-24 21:00:00 4
6 2011-10-24 22:00:00 2
$`297`
date obs
7 2011-10-25 00:00:00 4
8 2011-10-25 01:00:00 2
9 2011-10-25 02:00:00 2
10 2011-10-25 15:00:00 12
11 2011-10-25 18:00:00 2
12 2011-10-25 19:00:00 3
13 2011-10-25 21:00:00 2
14 2011-10-25 23:00:00 9
$`298`
date obs
15 2011-10-26 00:00:00 13
16 2011-10-26 01:00:00 11
The POSIXlt class does not work well inside dataframes but it can ve very handy for creating time based groups. It's a list structure with these indices: 'yday', 'wday', 'year', 'mon', 'mday', 'hour', 'min', 'sec' and 'isdt'. The cut.POSIXt function adds divisions at other natural boundaries; E.g.
?cut.POSIXt
split(dat , cut(dat$date, "week") )
If you wanted to sum within date:
tapply(dat$obs, as.POSIXlt(dat$date)$yday, sum)
#-------
296 297 298
47 36 24
I'd use a time series class such as xts
dat <- read.table(text="2011-10-24 01:00:00 12
2011-10-24 02:00:00 4
2011-10-24 19:00:00 18
2011-10-24 20:00:00 7
2011-10-24 21:00:00 4
2011-10-24 22:00:00 2
2011-10-25 00:00:00 4
2011-10-25 01:00:00 2
2011-10-25 02:00:00 2
2011-10-25 15:00:00 12
2011-10-25 18:00:00 2
2011-10-25 19:00:00 3
2011-10-25 21:00:00 2
2011-10-25 23:00:00 9
2011-10-26 00:00:00 13
2011-10-26 01:00:00 11", header=FALSE, stringsAsFactors=FALSE)
xobj <- xts(dat[, 3], as.POSIXct(paste(dat[, 1], dat[, 2])))
xts subsetting is very intuitive. For all data on "2011-10-25", do this
xobj["2011-10-25"]
# [,1]
#2011-10-25 00:00:00 4
#2011-10-25 01:00:00 2
#2011-10-25 02:00:00 2
#2011-10-25 15:00:00 12
#2011-10-25 18:00:00 2
#2011-10-25 19:00:00 3
#2011-10-25 21:00:00 2
#2011-10-25 23:00:00 9
You can also subset out time spans like this (all data between and including 2011-10-24 and 2011-10-25)
xobj["2011-10-24/2011-10-25"]
Or, if you want all data from October 2011,
xobj["2011-10"]
If you want to get all data from any day that is between 19:00 and 20:00,
xobj['T19:00:00/T20:00:00']
# [,1]
#2011-10-24 19:00:00 18
#2011-10-24 20:00:00 7
#2011-10-25 19:00:00 3
You can use the endpoints function to find the rows that are the last rows of a time period ("hours", "days", "weeks", etc.)
endpoints(xobj, "days")
[1] 0 6 14 16
Or you can convert to a lower frequency
to.weekly(xobj)
# xobj.Open xobj.High xobj.Low xobj.Close
#2011-10-26 12 18 2 11
to.daily(xobj)
# xobj.Open xobj.High xobj.Low xobj.Close
#2011-10-25 12 18 2 2
#2011-10-26 4 12 2 9
#2011-10-26 13 13 11 11
Notice that the above creates columns for Open, High, Low, and Close. If you only want the data at the endpoints, you can use OHLC=FALSE
to.daily(xobj, OHLC=FALSE)
# [,1]
#2011-10-25 2
#2011-10-26 9
#2011-10-26 11
For more basic subsetting, and much more, visit http://www.quantmod.com/examples/
As #JoshuaUlrich mentions in the comments, split.xts is INCREDIBLY useful.
You can split by day (or week, or month, etc), apply a function, then recombine
split(xobj, 'days') #create a list where each element is the data for a different day
#[[1]]
# [,1]
#2011-10-24 01:00:00 12
#2011-10-24 02:00:00 4
#2011-10-24 19:00:00 18
#2011-10-24 20:00:00 7
#2011-10-24 21:00:00 4
#2011-10-24 22:00:00 2
#
#[[2]]
# [,1]
#2011-10-25 00:00:00 4
#2011-10-25 01:00:00 2
#2011-10-25 02:00:00 2
#2011-10-25 15:00:00 12
#2011-10-25 18:00:00 2
#2011-10-25 19:00:00 3
#2011-10-25 21:00:00 2
#2011-10-25 23:00:00 9
#
#[[3]]
# [,1]
#2011-10-26 00:00:00 13
#2011-10-26 01:00:00 11
Suppose you want only the first value of each day. split by day, lapply the first function and rbind back together.
do.call(rbind, lapply(split(xobj, 'days'), first))
# [,1]
#2011-10-24 01:00:00 12
#2011-10-25 00:00:00 4
#2011-10-26 00:00:00 13

Resources