Associate numbers to datetime/timestamp - r

I have a dataframe df with a certain number of columns. One of them, ts, is timestamps:
1462147403122 1462147412990 1462147388224 1462147415651 1462147397069 1462147392497
...
1463529545634 1463529558639 1463529556798 1463529558788 1463529564627 1463529557370.
I have also at my disposal the corresponding datetime in the datetime column:
"2016-05-02 02:03:23 CEST" "2016-05-02 02:03:32 CEST" "2016-05-02 02:03:08 CEST" "2016-05-02 02:03:35 CEST" "2016-05-02 02:03:17 CEST" "2016-05-02 02:03:12 CEST"
...
"2016-05-18 01:59:05 CEST" "2016-05-18 01:59:18 CEST" "2016-05-18 01:59:16 CEST" "2016-05-18 01:59:18 CEST" "2016-05-18 01:59:24 CEST" "2016-05-18 01:59:17 CEST"
As you can see my dataframe contains data accross several day. Let's say there are 3. I would like to add a column containing number 1, 2 or 3. 1 if the line belongs to the first day, 2 for the second day, etc...
Thank you very much in advance,
Clement

One way to do this is to keep track of total days elapsed each time the date changes, as demonstrated below.
# Fake data
dat = data.frame(datetime = c(seq(as.POSIXct("2016-05-02 01:03:11"),
as.POSIXct("2016-05-05 01:03:11"), length.out=6),
seq(as.POSIXct("2016-05-09 01:09:11"),
as.POSIXct("2016-05-16 02:03:11"), length.out=4)))
tz(dat$datetime) = "UTC"
Note, if your datetime column is not already in a datetime format, convert it to one using as.POSIXct.
Now, create a new column with the day number, counting the first day in the sequence as day 1.
dat$day = c(1, cumsum(as.numeric(diff(as.Date(dat$datetime, tz="UTC")))) + 1)
dat
datetime day
1 2016-05-02 01:03:11 1
2 2016-05-02 15:27:11 1
3 2016-05-03 05:51:11 2
4 2016-05-03 20:15:11 2
5 2016-05-04 10:39:11 3
6 2016-05-05 01:03:11 4
7 2016-05-09 01:09:11 8
8 2016-05-11 09:27:11 10
9 2016-05-13 17:45:11 12
10 2016-05-16 02:03:11 15
I specified the timezone in the code above to avoid getting tripped up by potential silent shifts between my local timezone and UTC. For example, note the silent shift from my default local time zone ("America/Los_Angeles") to UTC when converting a POSIXct datetime to a date:
# Fake data
datetime = seq(as.POSIXct("2016-05-02 01:03:11"), as.POSIXct("2016-05-05 01:03:11"), length.out=6)
tz(datetime)
[1] ""
date = as.Date(datetime)
tz(date)
[1] "UTC"
data.frame(datetime, date)
datetime date
1 2016-05-02 01:03:11 2016-05-02
2 2016-05-02 15:27:11 2016-05-02
3 2016-05-03 05:51:11 2016-05-03
4 2016-05-03 20:15:11 2016-05-04 # Note day is different due to timezone shift
5 2016-05-04 10:39:11 2016-05-04
6 2016-05-05 01:03:11 2016-05-05

Related

Issue with timestamp parsing in read_csv_arrow from arrow package in R

Suppose there is a csv file named ta_sample.csv as under:
"BILL_DT","AMOUNT"
"2015-07-27T18:30:00Z",16000
"2015-07-07T18:30:00Z",6110
"2015-07-26T18:30:00Z",250
"2015-07-22T18:30:00Z",1000
"2015-07-06T18:30:00Z",2640000
Reading the above using read_csv_arrow and customizing the column types which is always needed in actual production data:
library(arrow)
read_csv_arrow(
"ta_sample.csv",
col_names = c("BILL_DT", "AMOUNT"),
col_types = "td",
skip = 1,
timestamp_parsers = c("%Y-%m-%dT%H:%M:%SZ"))
the result is as under:
# A tibble: 5 x 2
BILL_DT AMOUNT
<dttm> <dbl>
1 2015-07-28 00:00:00 16000
2 2015-07-08 00:00:00 6110
3 2015-07-27 00:00:00 250
4 2015-07-23 00:00:00 1000
5 2015-07-07 00:00:00 2640000
The issue here is that the dates are increased by one day and the time disappears. It is worth mentioning here that data.table::fread() as well as readr::read_csv() read it properly, eg,
library(readr)
read_csv("ta_sample.csv")
# A tibble: 5 x 2
BILL_DT AMOUNT
<dttm> <dbl>
1 2015-07-27 18:30:00 16000
2 2015-07-07 18:30:00 6110
3 2015-07-26 18:30:00 250
4 2015-07-22 18:30:00 1000
5 2015-07-06 18:30:00 2640000
Parsing example values in BILL_DT column with strptime also work perfectly as under:
strptime(c("2015-07-27T18:30:00Z", "2015-07-07T18:30:00Z"), "%Y-%m-%dT%H:%M:%SZ")
[1] "2015-07-27 18:30:00 IST" "2015-07-07 18:30:00 IST"
What parameters in read_csv_arrow need to be adjusted to get results identical to that given by readr::read_csv() ?
There are a few things going on here, but they all relate to timezones + how they are interpreted by various parts of R + Arrow + other packages.
When Arrow reads in timestamps, it treats the values as if they were UTC. Arrow does not yet have the ability to specify alternative timezones when parsing[1], so stores these values as timezoneless (and assumes UTC). Though in this case, since the timestamps you have are UTC (according to ISO_8601, the Z at the end means UTC) they are stored correctly in Arrow as timezoneless UTC timestamps. The values of the timestamps are the same (that is, they represent the same time in UTC), the difference is in how they are displayed: are they displayed as the time in UTC or are they displayed in the local timezone.
When the timestamps are converted into R, the timezonelessness is preserved:
> from_arrow <- read_csv_arrow(
+ "ta_sample.csv",
+ col_names = c("BILL_DT", "AMOUNT"),
+ col_types = "td",
+ skip = 1,
+ timestamp_parsers = c("%Y-%m-%dT%H:%M:%SZ"))
>
> attr(from_arrow$BILL_DT, "tzone")
NULL
R defaults to displaying timestamps without a tzone attribute in the local timezone (for me it's currently CDT, for you it looks like it's IST). And, note that timestamps with an explicit timezone are displayed in that timezone.
> from_arrow$BILL_DT
[1] "2015-07-27 13:30:00 CDT" "2015-07-07 13:30:00 CDT"
[3] "2015-07-26 13:30:00 CDT" "2015-07-22 13:30:00 CDT"
[5] "2015-07-06 13:30:00 CDT"
If you would like to display the UTC timestamps, you can do a few things:
Explicitly set the tzone attribute (or you could use lubridate::with_tz() for the same operation):
> attr(from_arrow$BILL_DT, "tzone") <- "UTC"
> from_arrow$BILL_DT
[1] "2015-07-27 18:30:00 UTC" "2015-07-07 18:30:00 UTC"
[3] "2015-07-26 18:30:00 UTC" "2015-07-22 18:30:00 UTC"
[5] "2015-07-06 18:30:00 UTC"
You can set the timezone in your R session so that when R goes to display the time it uses UTC (Note: the tzone attribute is still unset here, but the display is UTC because the session timezone is set to UTC)
> Sys.setenv(TZ="UTC")
> from_arrow <- read_csv_arrow(
3. "ta_sample.csv",
4. col_names = c("BILL_DT", "AMOUNT"),
5. col_types = "td",
6. skip = 1,
7. timestamp_parsers = c("%Y-%m-%dT%H:%M:%SZ"))
> from_arrow$BILL_DT
[1] "2015-07-27 18:30:00 UTC" "2015-07-07 18:30:00 UTC"
[3] "2015-07-26 18:30:00 UTC" "2015-07-22 18:30:00 UTC"
[5] "2015-07-06 18:30:00 UTC"
> attr(from_arrow$BILL_DT, "tzone")
NULL
You could read the data into an Arrow table, and cast the timestamp to have an explicit timezone in Arrow before pulling the data into R with collect(). This csv -> Arrow table -> data.frame is what happens under the hood, so there are no additional conversions going on here (other than the cast). And it can be useful + more efficient to do operations on the Arrow table if you have other transformations you are applying, though it is more code than the first two.
> library(arrow)
> library(dplyr)
> tab <- read_csv_arrow(
+ "ta_sample.csv",
+ col_names = c("BILL_DT", "AMOUNT"),
+ col_types = "td",
+ skip = 1,
+ as_data_frame = FALSE)
>
> tab_df <- tab %>%
+ mutate(BILL_DT_cast = cast(BILL_DT, timestamp(unit = "s", timezone = "UTC"))) %>%
+ collect()
> attr(tab_df$BILL_DT, "tzone")
NULL
> attr(tab_df$BILL_DT_cast, "tzone")
[1] "UTC"
> tab_df
# A tibble: 5 × 3
BILL_DT AMOUNT BILL_DT_cast
<dttm> <dbl> <dttm>
1 2015-07-27 13:30:00 16000 2015-07-27 18:30:00
2 2015-07-07 13:30:00 6110 2015-07-07 18:30:00
3 2015-07-26 13:30:00 250 2015-07-26 18:30:00
4 2015-07-22 13:30:00 1000 2015-07-22 18:30:00
5 2015-07-06 13:30:00 2640000 2015-07-06 18:30:00
This is also made a bit more confusing because base R's strptime() doesn't parse timezones (which is why you're seeing the same clock time but with IST in your example above). lubridate's[2] parsing functions do respect this, and you can see the difference here:
> lubridate::parse_date_time(c("2015-07-27T18:30:00Z", "2015-07-07T18:30:00Z"), "YmdHMS")
[1] "2015-07-27 18:30:00 UTC" "2015-07-07 18:30:00 UTC"
[1] Though we have two issues related to adding this functionality https://issues.apache.org/jira/browse/ARROW-12820 and https://issues.apache.org/jira/browse/ARROW-13348
[2] And, lubridate's docs even mention this:
ISO8601 signed offset in hours and minutes from UTC. For example -0800, -08:00 or -08, all represent 8 hours behind UTC. This format also matches the Z (Zulu) UTC indicator. Because base::strptime() doesn't fully support ISO8601 this format is implemented as an union of 4 orders: Ou (Z), Oz (-0800), OO (-08:00) and Oo (-08). You can use these four orders as any other but it is rarely necessary. parse_date_time2() and fast_strptime() support all of the timezone formats.
https://lubridate.tidyverse.org/reference/parse_date_time.html

Formatting a date in R without zeros in the year

I'm having trouble using a function POSIXct.
When I apply the function in my dataset, the year appears with two zeros ahead.
like this:
datu1$timestamp <- as.POSIXct(datu1$date.sec, origin = "1970-01-01", tz="GMT")
datu1$timestamp <- as.POSIXct(datu1$timestamp,
format = "%Y-%m-%d %H:%M:%S", tz = 'GMT')
head(datu1)
ID date.sec lon lat lon.025 lat.025 lon.5 lat.5 lon.975 lat.975
1 102211.10 -61827840000 -38.6616 -13.59272 -40.5025 -15.25025 -38.7 -13.76 -36.9000 -10.88950
2 102211.10 -61827818400 -38.6647 -13.60312 -40.4000 -15.17025 -38.7 -13.77 -37.0975 -11.03975
3 102211.10 -61827796800 -38.6723 -13.64505 -40.3000 -15.10000 -38.7 -13.79 -37.0000 -11.29950
4 102211.10 -61827775200 -38.6837 -13.68972 -40.2000 -14.98025 -38.7 -13.83 -37.2000 -11.45975
5 102211.10 -61827753600 -38.7030 -13.73054 -40.2000 -14.98100 -38.7 -13.84 -37.3000 -11.62925
6 102211.10 -61827732000 -38.7221 -13.77846 -40.0000 -15.04050 -38.7 -13.88 -37.5000 -11.69950
bmode bmode.5 timestamp
1 1.556 2 0010-10-03 00:00:00
2 1.565 2 0010-10-03 06:00:00
3 1.571 2 0010-10-03 12:00:00
4 1.571 2 0010-10-03 18:00:00
5 1.589 2 0010-10-04 00:00:00
6 1.599 2 0010-10-04 06:00:00
How can I fix this to get the full year (like: 2010) instead of two zeros?
Perhaps your data was encoded with a weird origin (e.g. excel uses "1899-12-30"). Just adapt the origin= 'till the date matches what you require.
as.POSIXct(-61827840000, origin="1970-01-01", tz="GMT")
# [1] "0010-10-03 GMT"
as.POSIXct(-61827840000, origin="3970-01-01", tz="GMT")
# [1] "2010-10-03 GMT"

Using logicals to change date format in R

New R user here - I have many .csv files containing time stamps (date_time) in one column and temperature readings in the other. I am trying to write a function that detects the date_time format, and then changes it to a different format. Because of the way the data was collected, the date/time format is different in some of the .csv files. I want the function to change the date_time for all files to the same format.
Date_time format I want: %m/%d/%y %H:%M:%S
Date_time format I want changed to above: "%y-%m-%d %H:%M:%S"
> head(file1data)
x date_time temperature coupler_d coupler_a host_con stopped EoF
1 1 18-07-10 09:00:00 41.137 Logged
2 2 18-07-10 09:15:00 41.322
3 3 18-07-10 09:30:00 41.554
4 4 18-07-10 09:45:00 41.832
5 5 18-07-10 10:00:00 42.156
6 6 18-07-10 10:15:00 42.755
> head(file2data)
x date_time temperature coupler_d coupler_a host_con stopped EoF
1 1 07/10/18 01:00:00 PM 8.070 Logged
2 2 07/10/18 01:15:00 PM 8.095
3 3 07/10/18 01:30:00 PM 8.120
4 4 07/10/18 01:45:00 PM 8.120
5 5 07/10/18 02:00:00 PM 8.020
6 6 07/10/18 02:15:00 PM 7.795
file2data is in the correct format. file1data is incorrect.
I have tried using logicals to detect and replace the date format e.g.,
file1data %>%
if(str_match_all(date_time,"([0-9][0-9]{2})[-.])")){
format(as.POSIXct(date_time,format="%y-%m-%d %H:%M:%S"),"%m/%d/%y %H:%M:%S")
}else{format(date_time,"%m/%d/%y %H:%M:%S")}
but this has not worked, I get the following errors:
Error in if (.) str_match_all(date_time, "([0-9][0-9]{2})[-.])") else { :
argument is not interpretable as logical
In addition: Warning message:
In if (.) str_match_all(date_time, "([0-9][0-9]{2})[-.])") else { :
the condition has length > 1 and only the first element will be used
Any ideas?

Why does dplyr convert POSIXct objects

I have a date-time object of class POSIXct. I need to adjust the values by adding several hours. I understand that I can do this using basic addition. For example, I can add 5 hours to a POSIXct object like so:
x <- as.POSIXct("2009-08-02 18:00:00", format="%Y-%m-%d %H:%M:%S")
x
[1] "2009-08-02 18:00:00 PDT"
x + (5*60*60)
[1] "2009-08-02 23:00:00 PDT"
Now I have a data frame in which some times are ok and some are bad.
> df
set_time duration up_time
1 2009-05-31 14:10:00 3 2009-05-31 11:10:00
2 2009-08-02 18:00:00 4 2009-08-02 23:00:00
3 2009-08-03 01:20:00 5 2009-08-03 06:20:00
4 2009-08-03 06:30:00 2 2009-08-03 11:30:00
Note that the first data frame entry has an 'up_time' less than the 'set_time'. So in this context a 'good' time is one where the set_time < up_time. And a 'bad' time is one in which set_time > up_time. I want to leave the good entries alone and fix the bad entries. The bad entries should be fixed by creating an 'up_time' that is equal to the 'set_time' + duration. I do this with the following dplyr pipe:
df1 <- tbl_df(df) %>% mutate(up_time = ifelse(set_time > up_time, set_time +
(duration*60*60), up_time))
df1
# A tibble: 4 x 3
set_time duration up_time
<dttm> <dbl> <dbl>
1 2009-05-31 14:10:00 3. 1243815000.
2 2009-08-02 18:00:00 4. 1249279200.
3 2009-08-03 01:20:00 5. 1249305600.
4 2009-08-03 06:30:00 2. 1249324200.
Up time has been coerced to numeric:
> str(df1)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4 obs. of 3 variables:
$ set_time: POSIXct, format: "2009-05-31 14:10:00" "2009-08-02 18:00:00"
"2009-08-03 01:20:00" "2009-08-03 06:30:00"
$ duration: num 3 4 5 2
$ up_time : num 1.24e+09 1.25e+09 1.25e+09 1.25e+09
I can convert it back to the desired POSIXct format using:
> as.POSIXct(df1$up_time,origin="1970-01-01")
[1] "2009-05-31 17:10:00 PDT" "2009-08-02 23:00:00 PDT" "2009-08-03 06:20:00
PDT" "2009-08-03 11:30:00 PDT"
But I feel like this last step shouldn't be necessary. Can I/How can I avoid having dplyr change my variable formatting?

R: Find missing timestamps in csv

as I failed to solve my problem with PHP/MySQL or Excel due to the data size, I'm trying to do my very first steps with R now and struggle a bit. The problem is this: I have a second-by-second CSV-file with half a year of data, that looks like this:
metering,timestamp
123,2016-01-01 00:00:00
345,2016-01-01 00:00:01
243,2016-01-01 00:00:02
101,2016-01-01 00:00:04
134,2016-01-01 00:00:06
As you see, there are some seconds missing every once in a while (don't ask me, why the values are written before the timestamp, but that's how I received the data…). Now I try to calculate the amount of values (= seconds) that are missing.
So my idea was
to create a vector that is correct (includes all sec-by-sec timestamps),
match the given CSV file with that new vector, and
sum up all the timestamps with no value.
I managed to make step 1 happen with the following code:
RegularTimeSeries <- seq(as.POSIXct("2016-01-01 00:00:00", tz = "UTC"), as.POSIXct("2016-01-01 00:00:30", tz = "UTC"), by = "1 sec")
write.csv(RegularTimeSeries, file = "RegularTimeSeries.csv")
To have an idea what I did I also exported the vector to a CSV that looks like this:
"1",2016-01-01 00:00:00
"2",2016-01-01 00:00:01
"3",2016-01-01 00:00:02
"4",2016-01-01 00:00:03
"5",2016-01-01 00:00:04
"6",2016-01-01 00:00:05
"7",2016-01-01 00:00:06
Unfortunately I have no idea how to go on with step 2 and 3. I found some very similar examples (http://www.r-bloggers.com/fix-missing-dates-with-r/, R: Insert rows for missing dates/times), but as a total R noob I struggled to translate these examples to my given sec-by-sec data.
Some hints for the greenhorn would be very very helpful – thank you very much in advance :)
In the tidyverse,
library(dplyr)
library(tidyr)
# parse datetimes
df %>% mutate(timestamp = as.POSIXct(timestamp)) %>%
# complete sequence to full sequence from min to max by second
complete(timestamp = seq.POSIXt(min(timestamp), max(timestamp), by = 'sec'))
## # A tibble: 7 x 2
## timestamp metering
## <time> <int>
## 1 2016-01-01 00:00:00 123
## 2 2016-01-01 00:00:01 345
## 3 2016-01-01 00:00:02 243
## 4 2016-01-01 00:00:03 NA
## 5 2016-01-01 00:00:04 101
## 6 2016-01-01 00:00:05 NA
## 7 2016-01-01 00:00:06 134
If you want the number of NAs (i.e. the number of seconds with no data), add on
%>% tally(is.na(metering))
## # A tibble: 1 x 1
## n
## <int>
## 1 2
You can check which values of your RegularTimeSeries are in your broken time series using which and %in%. First create BrokenTimeSeries from your example:
RegularTimeSeries <- seq(as.POSIXct("2016-01-01 00:00:00", tz = "UTC"), as.POSIXct("2016-01-01 00:00:30", tz = "UTC"), by = "1 sec")
BrokenTimeSeries <- RegularTimeSeries[-c(3,6,9)] # remove some seconds
This will give you the indeces of values within RegularTimeSeries that are not in BrokenTimeSeries:
> which(!(RegularTimeSeries %in% BrokenTimeSeries))
[1] 3 6 9
This will return the actual values:
> RegularTimeSeries[which(!(RegularTimeSeries %in% BrokenTimeSeries))]
[1] "2016-01-01 00:00:02 UTC" "2016-01-01 00:00:05 UTC" "2016-01-01 00:00:08 UTC"
Maybe I'm misunderstanding your problem but you can count the number of missing seconds simply subtracting the length of your broken time series from RegularTimeSeries or getting the length of any of the two resulting vectors above.
> length(RegularTimeSeries) - length(BrokenTimeSeries)
[1] 3
> length(which(!(RegularTimeSeries %in% BrokenTimeSeries)))
[1] 3
> length(RegularTimeSeries[which(!(RegularTimeSeries %in% BrokenTimeSeries))])
[1] 3
If you want to merge the files together to see the missing values you can do something like this:
#data with regular time series and a "step"
df <- data.frame(
RegularTimeSeries
)
df$BrokenTimeSeries[RegularTimeSeries %in% BrokenTimeSeries] <- df$RegularTimeSeries
df$BrokenTimeSeries <- as.POSIXct(df$BrokenTimeSeries, origin="2015-01-01", tz="UTC")
resulting in:
> df[1:12,]
RegularTimeSeries BrokenTimeSeries
1 2016-01-01 00:00:00 2016-01-01 00:00:00
2 2016-01-01 00:00:01 2016-01-01 00:00:01
3 2016-01-01 00:00:02 <NA>
4 2016-01-01 00:00:03 2016-01-01 00:00:02
5 2016-01-01 00:00:04 2016-01-01 00:00:03
6 2016-01-01 00:00:05 <NA>
7 2016-01-01 00:00:06 2016-01-01 00:00:04
8 2016-01-01 00:00:07 2016-01-01 00:00:05
9 2016-01-01 00:00:08 <NA>
10 2016-01-01 00:00:09 2016-01-01 00:00:06
11 2016-01-01 00:00:10 2016-01-01 00:00:07
12 2016-01-01 00:00:11 2016-01-01 00:00:08
If all you want is the number of missing seconds, it can be done much more simply. First find the number of seconds in your timerange, and then subtract the number of rows in your dataset. This could be done in R along these lines:
n.seconds <- difftime("2016-06-01 00:00:00", "2016-01-01 00:00:00", units="secs")
n.rows <- nrow(my.data.frame)
n.missing.values <- n.seconds - n.rows
You might change the time range and the variable of your data frame.
Hope it helps
d <- (c("2016-01-01 00:00:01",
"2016-01-01 00:00:02",
"2016-01-01 00:00:03",
"2016-01-01 00:00:04",
"2016-01-01 00:00:05",
"2016-01-01 00:00:06",
"2016-01-01 00:00:10",
"2016-01-01 00:00:12",
"2016-01-01 00:00:14",
"2016-01-01 00:00:16",
"2016-01-01 00:00:18",
"2016-01-01 00:00:20",
"2016-01-01 00:00:22"))
d <- as.POSIXct(d)
for (i in 2:length(d)){
if(difftime(d[i-1],d[i], units = "secs") < -1 ){
c[i] <- d[i]
}
}
class(c) <- c('POSIXt','POSIXct')
c
[1] NA NA NA
NA NA
[6] NA "2016-01-01 00:00:10 EST" "2016-01-01 00:00:12
EST" "2016-01-01 00:00:14 EST" "2016-01-01 00:00:16 EST"
[11] "2016-01-01 00:00:18 EST" "2016-01-01 00:00:20 EST" "2016-01-01
00:00:22 EST"

Resources