I have 3848 rows of POSIXct data - stop times of bike trips in the month of April. As you can see, all of the data is in POSIXct format and is within the range of the month of April.
length(output2_stoptime)
[1] 3848
head(output2_stoptime)
[1] "2015-04-01 17:19:27 EST" "2015-04-02 07:26:06 EST" "2015-04-08 10:09:37 EST"
[4] "2015-04-12 20:08:00 EST" "2015-04-13 17:53:11 EST" "2015-04-14 07:17:34 EST"
class(output2_stoptime)
[1] "POSIXct" "POSIXt"
range(output2_stoptime)
[1] "2015-04-01 00:34:29 EST" "2015-04-30 20:49:22 EST"
Sys.timezone()
[1] "EST"
However, when I try converting this into a table of stop times per day, I get 4 dates that are converted as the 1st of May. I thought this might be occurring due to the different system timezone as I am located in Europe at the moment, but even after setting the timezone to EST, the problem persists. For example:
by_day_output2 = as.data.frame(as.Date(output2_stoptime), tz = "EST")
colnames(by_day_output2)[1] = "SUM"
movements_Apr = as.data.frame(table(by_day_output2$SUM))
colnames(movements_Apr)[1] = "DATE"
tail(movements_Apr)
DATE Freq
26 2015-04-26 96
27 2015-04-27 125
28 2015-04-28 145
29 2015-04-29 151
30 2015-04-30 99
31 2015-05-01 4
Why are the four dates converting improperly when the time zones of the data and the system match? None of the data falls within May.
Related
Suppose there is a csv file named ta_sample.csv as under:
"BILL_DT","AMOUNT"
"2015-07-27T18:30:00Z",16000
"2015-07-07T18:30:00Z",6110
"2015-07-26T18:30:00Z",250
"2015-07-22T18:30:00Z",1000
"2015-07-06T18:30:00Z",2640000
Reading the above using read_csv_arrow and customizing the column types which is always needed in actual production data:
library(arrow)
read_csv_arrow(
"ta_sample.csv",
col_names = c("BILL_DT", "AMOUNT"),
col_types = "td",
skip = 1,
timestamp_parsers = c("%Y-%m-%dT%H:%M:%SZ"))
the result is as under:
# A tibble: 5 x 2
BILL_DT AMOUNT
<dttm> <dbl>
1 2015-07-28 00:00:00 16000
2 2015-07-08 00:00:00 6110
3 2015-07-27 00:00:00 250
4 2015-07-23 00:00:00 1000
5 2015-07-07 00:00:00 2640000
The issue here is that the dates are increased by one day and the time disappears. It is worth mentioning here that data.table::fread() as well as readr::read_csv() read it properly, eg,
library(readr)
read_csv("ta_sample.csv")
# A tibble: 5 x 2
BILL_DT AMOUNT
<dttm> <dbl>
1 2015-07-27 18:30:00 16000
2 2015-07-07 18:30:00 6110
3 2015-07-26 18:30:00 250
4 2015-07-22 18:30:00 1000
5 2015-07-06 18:30:00 2640000
Parsing example values in BILL_DT column with strptime also work perfectly as under:
strptime(c("2015-07-27T18:30:00Z", "2015-07-07T18:30:00Z"), "%Y-%m-%dT%H:%M:%SZ")
[1] "2015-07-27 18:30:00 IST" "2015-07-07 18:30:00 IST"
What parameters in read_csv_arrow need to be adjusted to get results identical to that given by readr::read_csv() ?
There are a few things going on here, but they all relate to timezones + how they are interpreted by various parts of R + Arrow + other packages.
When Arrow reads in timestamps, it treats the values as if they were UTC. Arrow does not yet have the ability to specify alternative timezones when parsing[1], so stores these values as timezoneless (and assumes UTC). Though in this case, since the timestamps you have are UTC (according to ISO_8601, the Z at the end means UTC) they are stored correctly in Arrow as timezoneless UTC timestamps. The values of the timestamps are the same (that is, they represent the same time in UTC), the difference is in how they are displayed: are they displayed as the time in UTC or are they displayed in the local timezone.
When the timestamps are converted into R, the timezonelessness is preserved:
> from_arrow <- read_csv_arrow(
+ "ta_sample.csv",
+ col_names = c("BILL_DT", "AMOUNT"),
+ col_types = "td",
+ skip = 1,
+ timestamp_parsers = c("%Y-%m-%dT%H:%M:%SZ"))
>
> attr(from_arrow$BILL_DT, "tzone")
NULL
R defaults to displaying timestamps without a tzone attribute in the local timezone (for me it's currently CDT, for you it looks like it's IST). And, note that timestamps with an explicit timezone are displayed in that timezone.
> from_arrow$BILL_DT
[1] "2015-07-27 13:30:00 CDT" "2015-07-07 13:30:00 CDT"
[3] "2015-07-26 13:30:00 CDT" "2015-07-22 13:30:00 CDT"
[5] "2015-07-06 13:30:00 CDT"
If you would like to display the UTC timestamps, you can do a few things:
Explicitly set the tzone attribute (or you could use lubridate::with_tz() for the same operation):
> attr(from_arrow$BILL_DT, "tzone") <- "UTC"
> from_arrow$BILL_DT
[1] "2015-07-27 18:30:00 UTC" "2015-07-07 18:30:00 UTC"
[3] "2015-07-26 18:30:00 UTC" "2015-07-22 18:30:00 UTC"
[5] "2015-07-06 18:30:00 UTC"
You can set the timezone in your R session so that when R goes to display the time it uses UTC (Note: the tzone attribute is still unset here, but the display is UTC because the session timezone is set to UTC)
> Sys.setenv(TZ="UTC")
> from_arrow <- read_csv_arrow(
3. "ta_sample.csv",
4. col_names = c("BILL_DT", "AMOUNT"),
5. col_types = "td",
6. skip = 1,
7. timestamp_parsers = c("%Y-%m-%dT%H:%M:%SZ"))
> from_arrow$BILL_DT
[1] "2015-07-27 18:30:00 UTC" "2015-07-07 18:30:00 UTC"
[3] "2015-07-26 18:30:00 UTC" "2015-07-22 18:30:00 UTC"
[5] "2015-07-06 18:30:00 UTC"
> attr(from_arrow$BILL_DT, "tzone")
NULL
You could read the data into an Arrow table, and cast the timestamp to have an explicit timezone in Arrow before pulling the data into R with collect(). This csv -> Arrow table -> data.frame is what happens under the hood, so there are no additional conversions going on here (other than the cast). And it can be useful + more efficient to do operations on the Arrow table if you have other transformations you are applying, though it is more code than the first two.
> library(arrow)
> library(dplyr)
> tab <- read_csv_arrow(
+ "ta_sample.csv",
+ col_names = c("BILL_DT", "AMOUNT"),
+ col_types = "td",
+ skip = 1,
+ as_data_frame = FALSE)
>
> tab_df <- tab %>%
+ mutate(BILL_DT_cast = cast(BILL_DT, timestamp(unit = "s", timezone = "UTC"))) %>%
+ collect()
> attr(tab_df$BILL_DT, "tzone")
NULL
> attr(tab_df$BILL_DT_cast, "tzone")
[1] "UTC"
> tab_df
# A tibble: 5 × 3
BILL_DT AMOUNT BILL_DT_cast
<dttm> <dbl> <dttm>
1 2015-07-27 13:30:00 16000 2015-07-27 18:30:00
2 2015-07-07 13:30:00 6110 2015-07-07 18:30:00
3 2015-07-26 13:30:00 250 2015-07-26 18:30:00
4 2015-07-22 13:30:00 1000 2015-07-22 18:30:00
5 2015-07-06 13:30:00 2640000 2015-07-06 18:30:00
This is also made a bit more confusing because base R's strptime() doesn't parse timezones (which is why you're seeing the same clock time but with IST in your example above). lubridate's[2] parsing functions do respect this, and you can see the difference here:
> lubridate::parse_date_time(c("2015-07-27T18:30:00Z", "2015-07-07T18:30:00Z"), "YmdHMS")
[1] "2015-07-27 18:30:00 UTC" "2015-07-07 18:30:00 UTC"
[1] Though we have two issues related to adding this functionality https://issues.apache.org/jira/browse/ARROW-12820 and https://issues.apache.org/jira/browse/ARROW-13348
[2] And, lubridate's docs even mention this:
ISO8601 signed offset in hours and minutes from UTC. For example -0800, -08:00 or -08, all represent 8 hours behind UTC. This format also matches the Z (Zulu) UTC indicator. Because base::strptime() doesn't fully support ISO8601 this format is implemented as an union of 4 orders: Ou (Z), Oz (-0800), OO (-08:00) and Oo (-08). You can use these four orders as any other but it is rarely necessary. parse_date_time2() and fast_strptime() support all of the timezone formats.
https://lubridate.tidyverse.org/reference/parse_date_time.html
I'm trying to create GPS schedules for satellite transmitters that are used to track the migration of a bird species I'm studying. The function below called 'sched_gps_fixes' takes a vector of datetimes and writes them to a .ASF file, which is uploaded to the satellite transmitter. This tells the transmitter what date and time to take a GPS fix. Using R and the sched_gps_fixes function allows me to quickly create a GPS schedule that starts on any day of the year. The software that comes with the transmitters does this as well, but I would have to painstakingly select each time and date I want the transmitter to take a GPS location.
So I want to: 1) create a data frame that contains every day of the year in 2018, and the time I want the transmitter to collect a GPS location, 2) use each row of the data frame as the start date for a sequence of datetimes (so starting on 2018-03-25 12:00:00 for example, I want to create a GPS schedule that takes a GPS point every other day after that, so 2018-03-25 12:00:00, 2018-03-27 12:00:00, etc.), and 3) create a .ASF file for each GPS schedule. Here's a simplified version of what I'm trying to accomplish below:
library(lubridate)
# set the beginning time
start_date <- ymd_hms('2018-01-01 12:00:00')
# create a sequence of datetimes starting January 1
days_df <- seq(ymd_hms(start_date), ymd_hms(start_date+days(10)), by='1 days')
tz(days_df) <- "America/Chicago"
days_df <- as.data.frame(days_df)
days_df
# to reproduce the example
days_df <- structure(list(days_df = structure(c(1514829600, 1514916000,
1515002400, 1515088800, 1515175200, 1515261600, 1515348000, 1515434400,
1515520800, 1515607200, 1515693600), class = c("POSIXct", "POSIXt"
), tzone = "America/Chicago")), .Names = "days_df", row.names = c(NA,
-11L), class = "data.frame")
# the data frame looks like this:
days_df
1 2018-01-01 12:00:00
2 2018-01-02 12:00:00
3 2018-01-03 12:00:00
4 2018-01-04 12:00:00
5 2018-01-05 12:00:00
6 2018-01-06 12:00:00
7 2018-01-07 12:00:00
8 2018-01-08 12:00:00
9 2018-01-09 12:00:00
10 2018-01-10 12:00:00
11 2018-01-11 12:00:00
I would like to loop through each datetime in the data frame, and create a vector for each row of the data frame. So each vector would have a particular row's datetime as the starting date for a GPS schedule, which would take a point every 2 days (something like this):
[1] "2018-01-01 12:00:00 UTC" "2018-01-03 12:00:00 UTC" "2018-01-05 12:00:00 UTC" "2018-01-07 12:00:00 UTC"
[5] "2018-01-09 12:00:00 UTC" "2018-01-11 12:00:00 UTC"
Each vector (or GPS schedule) would then be run in the following function as 'gps_schedule' to create a .ASF file for the transmitters:
sched_gps_fixes(gps_schedule, tz = "America/Chicago", out_file = "./gps_fixes")
So I'm wondering how to create a for loop that would produce a vector of datetimes for each day of 2018. This is pseudocode for what I'm attempting to do:
# create a loop called 'create_schedules' to make the GPS schedules and produce a .ASF file for each day of 2018
create_schedules <- function(days_df) {
for(row in 1:nrow(days_df)) {
seq(ymd_hms(days_df[[i]]), ymd_hms(days_df[[i]]+days(10)), by='2 days')
}
}
# run the function
create_schedules(days_df)
I'm guessing I need an output to store and name each vector by its start date, among other things?
Thanks,
Jay
One option is to use mapply to generate schedule for each row based on schedule definition provided by OP:
library(lubridate)
# For the sample data max_date needs to be calculated. Otherwise to generate
# schedule for whole 2018 max_date can be taken as 31-Dec-2018.
max_date = max(days_df$days_df)
mapply(function(x)seq(x, max_date, by="2 days"),days_df$days_df)
#Result : Only first 3 items from the list generated. It will continue
# [[1]]
# [1] "2018-01-01 12:00:00 CST" "2018-01-03 12:00:00 CST" "2018-01-05 12:00:00 CST"
# [4] "2018-01-07 12:00:00 CST" "2018-01-09 12:00:00 CST" "2018-01-11 12:00:00 CST"
#
# [[2]]
# [1] "2018-01-02 12:00:00 CST" "2018-01-04 12:00:00 CST" "2018-01-06 12:00:00 CST"
# [4] "2018-01-08 12:00:00 CST" "2018-01-10 12:00:00 CST"
#
# [[3]]
# [1] "2018-01-03 12:00:00 CST" "2018-01-05 12:00:00 CST" "2018-01-07 12:00:00 CST"
# [4] "2018-01-09 12:00:00 CST" "2018-01-11 12:00:00 CST"
# ....
# ....
# ....
# [[10]]
# [1] "2018-01-10 12:00:00 CST"
#
# [[11]]
# [1] "2018-01-11 12:00:00 CST"
If OP prefers to have names for items in result list then mapply can be used as:
Update: Based on OP's request to generate schedule for start+10 days. 10 days is equivalent to 10*24*3600 seconds.
mapply(function(x, y)seq(y, y+10*24*3600, by="2 days"),
as.character(days_df$days_df), days_df$days_df,
SIMPLIFY = FALSE,USE.NAMES = TRUE)
#Result
# $`2018-01-01 12:00:00`
# [1] "2018-01-01 12:00:00 CST" "2018-01-03 12:00:00 CST" "2018-01-05 12:00:00 CST"
# [4] "2018-01-07 12:00:00 CST" "2018-01-09 12:00:00 CST" "2018-01-11 12:00:00 CST"
#.......
#.......
#.......so on
Consider the following example
library(lubridate)
library(tidyverse)
> hour(ymd_hms('2008-01-04 00:00:00'))
[1] 0
Now,
dataframe <- data_frame(time = c(ymd_hms('2008-01-04 00:00:00'),
ymd_hms('2008-01-04 00:01:00'),
ymd_hms('2008-01-04 00:02:00'),
ymd_hms('2008-01-04 00:03:00')),
value = c(1,2,3,4))
mutate(dataframe,hour = strftime(time, format="%H:%M:%S"),
hour2 = hour(time))
# A tibble: 4 × 4
time value hour hour2
<dttm> <dbl> <chr> <int>
1 2008-01-03 19:00:00 1 19:00:00 19
2 2008-01-03 19:01:00 2 19:01:00 19
3 2008-01-03 19:02:00 3 19:02:00 19
4 2008-01-03 19:03:00 4 19:03:00 19
What is going on here? Why are the dates converted into some local time which I dont event know?
This is not an issue with lubridate, but with the way POSIXct values are combined into a vector.
You have
> ymd_hms('2008-01-04 00:01:00')
[1] "2008-01-04 00:01:00 UTC"
But when combining into a vector you get
> c(ymd_hms('2008-01-04 00:01:00'), ymd_hms('2008-01-04 00:01:00'))
[1] "2008-01-03 19:01:00 EST" "2008-01-03 19:01:00 EST"
The reason is that the tzone attribute gets lost when combining POSIXct values (see c.POSIXct).
> attributes(ymd_hms('2008-01-04 00:01:00'))
$tzone
[1] "UTC"
$class
[1] "POSIXct" "POSIXt"
but
> attributes(c(ymd_hms('2008-01-04 00:01:00')))
$class
[1] "POSIXct" "POSIXt"
What you can use instead is
> ymd_hms(c('2008-01-04 00:01:00', '2008-01-04 00:01:00'))
[1] "2008-01-04 00:01:00 UTC" "2008-01-04 00:01:00 UTC"
which will use the default tz = "UTC" for all arguments.
You also need to pass tz = "UTC" into strftime because its default is your current time zone (unlike ymd_hms which defaults to UTC).
I have recently updated R to version 3.2.3 and now I found a problem using seq with dates:
date1<-as.POSIXct("2014-01-30 02:00:00")
date2<-as.POSIXct("2014-12-24 11:00:00")
seq(date1,date2,by="month")
#[1] "2014-01-30 02:00:00 CET" "2014-03-02 02:00:00 CET"
#[3] NA "2014-04-30 02:00:00 CEST"
#[5] "2014-05-30 02:00:00 CEST" "2014-06-30 02:00:00 CEST"
#[7] "2014-07-30 02:00:00 CEST" "2014-08-30 02:00:00 CEST"
#[9] "2014-09-30 02:00:00 CEST" "2014-10-30 02:00:00 CET"
#[11] "2014-11-30 02:00:00 CET"
I don't understand where the NA comes from. I have tried on different machines with both the same R version as mine or a previous one and in the place of that NA they correctly give "2014-03-30". Furthermore, if I change the year in the dates from 2014 to 2015, no NAs are returned!
I guess that during the installation something in my locale was modified but I cannot understand how to fix the problem.
Sys.getlocale() returns:
"en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"
and my system is a Mac Book Pro with Maverick.
Thanks for any help!
I was guessing Germany and here's what the situation was in that CET timezone on Mar 30 (a Sunday)
http://www.timetemperature.com/utc-converter/utc-20140330-germany-12.html
UTC or GMT Time Germany
Sunday 30th March 2014 00:00:00 Sun 01:00 AM
Sunday 30th March 2014 01:00:00 Sun 03:00 AM*
Changing the setting to Italy, I get the same result:
UTC or GMT Time Italy
Sunday 30th March 2014 00:00:00 Sun 01:00 AM
Sunday 30th March 2014 01:00:00 Sun 03:00 AM*
The key here is to be suspicious of weirdness when the time is in the early morning hours of a Spring or Fall date, or when calculations of intervals crosses such dates. The rules change from year to year, and since countries often do the switch on a Sunday or Saturday morning, the exact dates jump around.
The changes vary by country and in the US they may vary by state or even "sub-state" boundaries: In Washington State in 2014 you find the change happening onhte second Sunday of March:
http://www.timetemperature.com/utc-converter/utc-20140309-us-washington+state-12.html
UTC or GMT Time US-Washington State
snipped several times
Sunday 9th March 2014 07:00:00 Sat 11:00 PM
Sunday 9th March 2014 08:00:00 Sun 12:00 AM
Sunday 9th March 2014 09:00:00 Sun 01:00 AM
Sunday 9th March 2014 10:00:00 Sun 03:00 AM*
Sunday 9th March 2014 11:00:00 Sun 04:00 AM*
I'm in the same TZ as Washington state. With a Sys.timezone set, one can reproduce the NA, at least on a Mac. The implementation of times and timezones is OS-specific, so it's possible to see variations in how these weirdities get visible:
> Sys.timezone(location = TRUE)
[1] "America/Los_Angeles"
> date1<-as.POSIXct("2014-01-09 02:00:00")
> date2<-as.POSIXct("2014-12-09 11:00:00")
> seq(date1,date2,by="month")
[1] "2014-01-09 02:00:00 PST" "2014-02-09 02:00:00 PST"
[3] NA "2014-04-09 02:00:00 PDT"
[5] "2014-05-09 02:00:00 PDT" "2014-06-09 02:00:00 PDT"
[7] "2014-07-09 02:00:00 PDT" "2014-08-09 02:00:00 PDT"
[9] "2014-09-09 02:00:00 PDT" "2014-10-09 02:00:00 PDT"
[11] "2014-11-09 02:00:00 PST" "2014-12-09 02:00:00 PST"
By inspecting the relevant code in seq.POSIXt there appears that a call to seq with by="month" works as follows
[some manipulation of the data]
conversion of data1 & data2 to POSIXlt
creation of a sequence of months numbers spanning the interval from data1 to data2 (in this case 0,...,11)
manual update of data1$mon to this sequence of months (and up to this point the dates are all properly handled)
finally, the resulting dates are converted to POSIXct and here the NA shows up
while the resulting NA is technically correct, since it is trying to convert an invalid date ("2014-01-30 02:00:00 CET", which does not exist) to POSIXct, could the issue be possibly worked around by passing through difftimes? [*]
not sure it is worth, though...
[*] here by difftimes I mean to add the correct number of seconds to the dates instead of just adding the months...
I am quite new in programming and R Software.
My data-set includes date-time variables as following:
2007/11/0103
2007/11/0104
2007/11/0105
2007/11/0106
I need an operator which count from left up to the character number 10 and then execute a space and copy the last two characters and then add :00 for all columns.
Expected results:
2007/11/01 03:00
2007/11/01 04:00
2007/11/01 05:00
2007/11/01 06:00
If you want to actually turn your data into a "POSIXlt" "POSIXt" class in R (so you could subtract/add days, minutes and etc from/to it) you could do
# Your data
temp <- c("2007/11/0103", "2007/11/0104", "2007/11/0105", "2007/11/0106")
temp2 <- strptime(temp, "%Y/%m/%d%H")
## [1] "2007-11-01 03:00:00 IST" "2007-11-01 04:00:00 IST" "2007-11-01 05:00:00 IST" "2007-11-01 06:00:00 IST"
You could then extract hours for example
temp2$hour
## [1] 3 4 5 6
Add hours
temp2 + 3600
## [1] "2007-11-01 04:00:00 IST" "2007-11-01 05:00:00 IST" "2007-11-01 06:00:00 IST" "2007-11-01 07:00:00 IST"
And so on. If you just want the format you mentioned in your question (which is just a character string), you can also do
format(strptime(temp, "%Y/%m/%d%H"), format = "%Y/%m/%d %H:%M")
#[1] "2007/11/01 03:00" "2007/11/01 04:00" "2007/11/01 05:00" "2007/11/01 06:00"
Try
library(lubridate)
dat <- read.table(text="2007/11/0103
2007/11/0104
2007/11/0105
2007/11/0106",header=F,stringsAsFactors=F)
dat$V1 <- format(ymd_h(dat$V1),"%Y/%m/%d %H:%M")
dat
# V1
# 1 2007/11/01 03:00
# 2 2007/11/01 04:00
# 3 2007/11/01 05:00
# 4 2007/11/01 06:00
Suppose your dates are a vector named dates
library(stringr)
paste0(paste(str_sub(dates, end=10), str_sub(dates, 11)), ":00")
paste and substr are your friends here. Type ? before either to see the documentation
my.parser <- function(a){
paste0(substr(a, 0,10),' ',substr(a,11,12),':00') # paste0 is like paste but does not add whitespace
}
a<- '2007/11/0103'
my.parser(a) # = "2007/11/01 03:00"