Consider the following example
library(lubridate)
library(tidyverse)
> hour(ymd_hms('2008-01-04 00:00:00'))
[1] 0
Now,
dataframe <- data_frame(time = c(ymd_hms('2008-01-04 00:00:00'),
ymd_hms('2008-01-04 00:01:00'),
ymd_hms('2008-01-04 00:02:00'),
ymd_hms('2008-01-04 00:03:00')),
value = c(1,2,3,4))
mutate(dataframe,hour = strftime(time, format="%H:%M:%S"),
hour2 = hour(time))
# A tibble: 4 × 4
time value hour hour2
<dttm> <dbl> <chr> <int>
1 2008-01-03 19:00:00 1 19:00:00 19
2 2008-01-03 19:01:00 2 19:01:00 19
3 2008-01-03 19:02:00 3 19:02:00 19
4 2008-01-03 19:03:00 4 19:03:00 19
What is going on here? Why are the dates converted into some local time which I dont event know?
This is not an issue with lubridate, but with the way POSIXct values are combined into a vector.
You have
> ymd_hms('2008-01-04 00:01:00')
[1] "2008-01-04 00:01:00 UTC"
But when combining into a vector you get
> c(ymd_hms('2008-01-04 00:01:00'), ymd_hms('2008-01-04 00:01:00'))
[1] "2008-01-03 19:01:00 EST" "2008-01-03 19:01:00 EST"
The reason is that the tzone attribute gets lost when combining POSIXct values (see c.POSIXct).
> attributes(ymd_hms('2008-01-04 00:01:00'))
$tzone
[1] "UTC"
$class
[1] "POSIXct" "POSIXt"
but
> attributes(c(ymd_hms('2008-01-04 00:01:00')))
$class
[1] "POSIXct" "POSIXt"
What you can use instead is
> ymd_hms(c('2008-01-04 00:01:00', '2008-01-04 00:01:00'))
[1] "2008-01-04 00:01:00 UTC" "2008-01-04 00:01:00 UTC"
which will use the default tz = "UTC" for all arguments.
You also need to pass tz = "UTC" into strftime because its default is your current time zone (unlike ymd_hms which defaults to UTC).
Related
I'm new to R - and searched old post for an answer but failed to come across anything that resolved my issue.
I pulled in a csv with the time a trip started in the mdy h:mm:ss format, but it is currently recognized as a character. I've tried to use mdy_hms(c("11/1/2020 0:05:00","11/1/2020 7:29:00","11/1/2020 14:04:00"))
as well as
as.Date(parse_date_time(dc_biketrips$started_at, c(mdy_hms))) to no avail.
Does anyone have any suggestions for how I could fix this?
UPDATE: I also tried to use date <-mdy_hms(c("11/1/2020 0:05:00","11/1/2020 7:29:00","11/1/2020 14:04:00")) str(date)but this also did not work
attempt to use date <-mdy_hms(C("11/1/2020 0:05:00"etc
image of csv
The first of your two options works:
library(lubridate)
date <-mdy_hms(c("11/1/2020 0:05:00","11/1/2020 7:29:00","11/1/2020 14:04:00"))
str(date)
# POSIXct[1:3], format: "2020-11-01 00:05:00" "2020-11-01 07:29:00" "2020-11-01 14:04:00"
How were your data "pulled in"?
One option would be to use as.POSIXct:
started_at <- c("11/1/2020 0:05:00","11/1/2020 7:29:00","11/1/2020 14:04:00")
as.POSIXct(started_at, format = "%m/%d/%Y %H:%M:%OS")
#> [1] "2020-11-01 00:05:00 CET" "2020-11-01 07:29:00 CET"
#> [3] "2020-11-01 14:04:00 CET"
EDIT
library(lubridate)
library(dplyr)
started_at <- c("11/1/2020 0:05:00","11/1/2020 7:29:00","11/1/2020 14:04:00")
Both as.POSIXct and lubridate::mdy_hms return an object of class "POSIXct" "POSIXt"
class(as.POSIXct(started_at, format = "%m/%d/%Y %H:%M:%OS"))
#> [1] "POSIXct" "POSIXt"
class(mdy_hms(started_at))
#> [1] "POSIXct" "POSIXt"
Not sure what you expect. When I run your code everything works fine except that we end up with 0 obs after filtering for week < 15 as all the dates in the example data are from week 44:
dc_biketrips <- data.frame(
started_at
)
dc_biketrips <- dc_biketrips %>%
mutate(started_at = as.POSIXct(started_at, format = "%m/%d/%Y %H:%M:%OS"),
interval60 = floor_date(started_at, unit = "hour"),
interval15 = floor_date(started_at, unit = "15 mins"),
week = week(interval60),
dotw = wday(interval60, label=TRUE))
dc_biketrips
#> started_at interval60 interval15 week dotw
#> 1 2020-11-01 00:05:00 2020-11-01 00:00:00 2020-11-01 00:00:00 44 So
#> 2 2020-11-01 07:29:00 2020-11-01 07:00:00 2020-11-01 07:15:00 44 So
#> 3 2020-11-01 14:04:00 2020-11-01 14:00:00 2020-11-01 14:00:00 44 So
dc_biketrips %>%
filter(week < 15)
#> [1] started_at interval60 interval15 week dotw
#> <0 rows> (or 0-length row.names)
I am trying to figure out how to add time intervals to a dates vector in R using units of less than 1 second. As you can see, padding one second intervals works, however decimals fail. I have looked around and it seems there is no "milliseconds" unit in R. Is there some way I can pad a vector of times with intervals of less than 1 second?
> require(padr)
> require(dplyr)
> options(digits.secs=3)
> start = as.POSIXct("2020-01-28 03:31:22.209 EST", format = "%Y-%m-%d %H:%M:%OS")
> end = as.POSIXct("2020-01-28 05:31:22.209 EST", format = "%Y-%m-%d %H:%M:%OS")
>
> minz <- seq(start, end, units = "minutes", by = "1 min")
> head(minz)
[1] "2020-01-28 03:31:22.209 EST" "2020-01-28 03:32:22.209 EST" "2020-01-28 03:33:22.209 EST" "2020-01-28 03:34:22.209 EST"
[5] "2020-01-28 03:35:22.209 EST" "2020-01-28 03:36:22.209 EST"
> secz <- as.data.frame(minz) %>% pad('1 sec')
> head(secz)
minz
1 2020-01-28 03:31:22.209
2 2020-01-28 03:31:23.209
3 2020-01-28 03:31:24.209
4 2020-01-28 03:31:25.209
5 2020-01-28 03:31:26.209
6 2020-01-28 03:31:27.209
> half_secz <- as.data.frame(minz) %>% pad('0.5 sec')
Error in seq.int(0, to0 - from, by) : invalid '(to - from)/by'
> head(secz)
minz
1 2020-01-28 03:31:22.209
2 2020-01-28 03:31:23.209
3 2020-01-28 03:31:24.209
4 2020-01-28 03:31:25.209
5 2020-01-28 03:31:26.209
6 2020-01-28 03:31:27.209
We can use seq as it is specifying the by argument as 0.5.
start = as.POSIXct("2020-01-28 03:31:22.209", format = "%Y-%m-%d %H:%M:%OS",tz = "UTC")
end = as.POSIXct("2020-01-28 05:31:22.209", format = "%Y-%m-%d %H:%M:%OS",tz = "UTC")
seq(start, end, by = 0.5) %>% head
#[1] "2020-01-28 03:31:22.209 UTC" "2020-01-28 03:31:22.709 UTC"
#[3] "2020-01-28 03:31:23.209 UTC" "2020-01-28 03:31:23.709 UTC"
#[5] "2020-01-28 03:31:24.209 UTC" "2020-01-28 03:31:24.709 UTC"
Or if you want to use it in secz dataframe
tidyr::complete(secz, minz = seq(min(minz), max(minz), by = 0.5))
# minz
# <dttm>
# 1 2020-01-28 03:31:22.209
# 2 2020-01-28 03:31:22.709
# 3 2020-01-28 03:31:23.209
# 4 2020-01-28 03:31:23.709
# 5 2020-01-28 03:31:24.209
# 6 2020-01-28 03:31:24.709
# 7 2020-01-28 03:31:25.209
# 8 2020-01-28 03:31:25.709
# 9 2020-01-28 03:31:26.209
#10 2020-01-28 03:31:26.709
# … with 14,391 more rows
I have a date-time object of class POSIXct. I need to adjust the values by adding several hours. I understand that I can do this using basic addition. For example, I can add 5 hours to a POSIXct object like so:
x <- as.POSIXct("2009-08-02 18:00:00", format="%Y-%m-%d %H:%M:%S")
x
[1] "2009-08-02 18:00:00 PDT"
x + (5*60*60)
[1] "2009-08-02 23:00:00 PDT"
Now I have a data frame in which some times are ok and some are bad.
> df
set_time duration up_time
1 2009-05-31 14:10:00 3 2009-05-31 11:10:00
2 2009-08-02 18:00:00 4 2009-08-02 23:00:00
3 2009-08-03 01:20:00 5 2009-08-03 06:20:00
4 2009-08-03 06:30:00 2 2009-08-03 11:30:00
Note that the first data frame entry has an 'up_time' less than the 'set_time'. So in this context a 'good' time is one where the set_time < up_time. And a 'bad' time is one in which set_time > up_time. I want to leave the good entries alone and fix the bad entries. The bad entries should be fixed by creating an 'up_time' that is equal to the 'set_time' + duration. I do this with the following dplyr pipe:
df1 <- tbl_df(df) %>% mutate(up_time = ifelse(set_time > up_time, set_time +
(duration*60*60), up_time))
df1
# A tibble: 4 x 3
set_time duration up_time
<dttm> <dbl> <dbl>
1 2009-05-31 14:10:00 3. 1243815000.
2 2009-08-02 18:00:00 4. 1249279200.
3 2009-08-03 01:20:00 5. 1249305600.
4 2009-08-03 06:30:00 2. 1249324200.
Up time has been coerced to numeric:
> str(df1)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4 obs. of 3 variables:
$ set_time: POSIXct, format: "2009-05-31 14:10:00" "2009-08-02 18:00:00"
"2009-08-03 01:20:00" "2009-08-03 06:30:00"
$ duration: num 3 4 5 2
$ up_time : num 1.24e+09 1.25e+09 1.25e+09 1.25e+09
I can convert it back to the desired POSIXct format using:
> as.POSIXct(df1$up_time,origin="1970-01-01")
[1] "2009-05-31 17:10:00 PDT" "2009-08-02 23:00:00 PDT" "2009-08-03 06:20:00
PDT" "2009-08-03 11:30:00 PDT"
But I feel like this last step shouldn't be necessary. Can I/How can I avoid having dplyr change my variable formatting?
as I failed to solve my problem with PHP/MySQL or Excel due to the data size, I'm trying to do my very first steps with R now and struggle a bit. The problem is this: I have a second-by-second CSV-file with half a year of data, that looks like this:
metering,timestamp
123,2016-01-01 00:00:00
345,2016-01-01 00:00:01
243,2016-01-01 00:00:02
101,2016-01-01 00:00:04
134,2016-01-01 00:00:06
As you see, there are some seconds missing every once in a while (don't ask me, why the values are written before the timestamp, but that's how I received the data…). Now I try to calculate the amount of values (= seconds) that are missing.
So my idea was
to create a vector that is correct (includes all sec-by-sec timestamps),
match the given CSV file with that new vector, and
sum up all the timestamps with no value.
I managed to make step 1 happen with the following code:
RegularTimeSeries <- seq(as.POSIXct("2016-01-01 00:00:00", tz = "UTC"), as.POSIXct("2016-01-01 00:00:30", tz = "UTC"), by = "1 sec")
write.csv(RegularTimeSeries, file = "RegularTimeSeries.csv")
To have an idea what I did I also exported the vector to a CSV that looks like this:
"1",2016-01-01 00:00:00
"2",2016-01-01 00:00:01
"3",2016-01-01 00:00:02
"4",2016-01-01 00:00:03
"5",2016-01-01 00:00:04
"6",2016-01-01 00:00:05
"7",2016-01-01 00:00:06
Unfortunately I have no idea how to go on with step 2 and 3. I found some very similar examples (http://www.r-bloggers.com/fix-missing-dates-with-r/, R: Insert rows for missing dates/times), but as a total R noob I struggled to translate these examples to my given sec-by-sec data.
Some hints for the greenhorn would be very very helpful – thank you very much in advance :)
In the tidyverse,
library(dplyr)
library(tidyr)
# parse datetimes
df %>% mutate(timestamp = as.POSIXct(timestamp)) %>%
# complete sequence to full sequence from min to max by second
complete(timestamp = seq.POSIXt(min(timestamp), max(timestamp), by = 'sec'))
## # A tibble: 7 x 2
## timestamp metering
## <time> <int>
## 1 2016-01-01 00:00:00 123
## 2 2016-01-01 00:00:01 345
## 3 2016-01-01 00:00:02 243
## 4 2016-01-01 00:00:03 NA
## 5 2016-01-01 00:00:04 101
## 6 2016-01-01 00:00:05 NA
## 7 2016-01-01 00:00:06 134
If you want the number of NAs (i.e. the number of seconds with no data), add on
%>% tally(is.na(metering))
## # A tibble: 1 x 1
## n
## <int>
## 1 2
You can check which values of your RegularTimeSeries are in your broken time series using which and %in%. First create BrokenTimeSeries from your example:
RegularTimeSeries <- seq(as.POSIXct("2016-01-01 00:00:00", tz = "UTC"), as.POSIXct("2016-01-01 00:00:30", tz = "UTC"), by = "1 sec")
BrokenTimeSeries <- RegularTimeSeries[-c(3,6,9)] # remove some seconds
This will give you the indeces of values within RegularTimeSeries that are not in BrokenTimeSeries:
> which(!(RegularTimeSeries %in% BrokenTimeSeries))
[1] 3 6 9
This will return the actual values:
> RegularTimeSeries[which(!(RegularTimeSeries %in% BrokenTimeSeries))]
[1] "2016-01-01 00:00:02 UTC" "2016-01-01 00:00:05 UTC" "2016-01-01 00:00:08 UTC"
Maybe I'm misunderstanding your problem but you can count the number of missing seconds simply subtracting the length of your broken time series from RegularTimeSeries or getting the length of any of the two resulting vectors above.
> length(RegularTimeSeries) - length(BrokenTimeSeries)
[1] 3
> length(which(!(RegularTimeSeries %in% BrokenTimeSeries)))
[1] 3
> length(RegularTimeSeries[which(!(RegularTimeSeries %in% BrokenTimeSeries))])
[1] 3
If you want to merge the files together to see the missing values you can do something like this:
#data with regular time series and a "step"
df <- data.frame(
RegularTimeSeries
)
df$BrokenTimeSeries[RegularTimeSeries %in% BrokenTimeSeries] <- df$RegularTimeSeries
df$BrokenTimeSeries <- as.POSIXct(df$BrokenTimeSeries, origin="2015-01-01", tz="UTC")
resulting in:
> df[1:12,]
RegularTimeSeries BrokenTimeSeries
1 2016-01-01 00:00:00 2016-01-01 00:00:00
2 2016-01-01 00:00:01 2016-01-01 00:00:01
3 2016-01-01 00:00:02 <NA>
4 2016-01-01 00:00:03 2016-01-01 00:00:02
5 2016-01-01 00:00:04 2016-01-01 00:00:03
6 2016-01-01 00:00:05 <NA>
7 2016-01-01 00:00:06 2016-01-01 00:00:04
8 2016-01-01 00:00:07 2016-01-01 00:00:05
9 2016-01-01 00:00:08 <NA>
10 2016-01-01 00:00:09 2016-01-01 00:00:06
11 2016-01-01 00:00:10 2016-01-01 00:00:07
12 2016-01-01 00:00:11 2016-01-01 00:00:08
If all you want is the number of missing seconds, it can be done much more simply. First find the number of seconds in your timerange, and then subtract the number of rows in your dataset. This could be done in R along these lines:
n.seconds <- difftime("2016-06-01 00:00:00", "2016-01-01 00:00:00", units="secs")
n.rows <- nrow(my.data.frame)
n.missing.values <- n.seconds - n.rows
You might change the time range and the variable of your data frame.
Hope it helps
d <- (c("2016-01-01 00:00:01",
"2016-01-01 00:00:02",
"2016-01-01 00:00:03",
"2016-01-01 00:00:04",
"2016-01-01 00:00:05",
"2016-01-01 00:00:06",
"2016-01-01 00:00:10",
"2016-01-01 00:00:12",
"2016-01-01 00:00:14",
"2016-01-01 00:00:16",
"2016-01-01 00:00:18",
"2016-01-01 00:00:20",
"2016-01-01 00:00:22"))
d <- as.POSIXct(d)
for (i in 2:length(d)){
if(difftime(d[i-1],d[i], units = "secs") < -1 ){
c[i] <- d[i]
}
}
class(c) <- c('POSIXt','POSIXct')
c
[1] NA NA NA
NA NA
[6] NA "2016-01-01 00:00:10 EST" "2016-01-01 00:00:12
EST" "2016-01-01 00:00:14 EST" "2016-01-01 00:00:16 EST"
[11] "2016-01-01 00:00:18 EST" "2016-01-01 00:00:20 EST" "2016-01-01
00:00:22 EST"
I have a POSIXct class vector containing am hours and I want to replace the values in a data frame containing a character class column. When I do the replacement the class changes to character. I'm proceeding as follows:
class(data2014.im.t[,2])
[1] "character"
class(horas.am)
[1] "POSIXct" "POSIXt"
head(horas.am)
[1] "1970-01-01 09:00:00 COT" "1970-01-01 10:00:00 COT" "1970-01-01 11:00:00 COT" "1970-01-01 12:00:00 COT"
[5] "1970-01-01 01:00:00 COT" "1970-01-01 02:00:00 COT"
data2014.im.t[grep("([a])", data2014.im.t[,2]), 2] <- horas.am
class(data2014.im.t[,2])
[1] "character"
head(data2014.im.t[,2])
[1] "50400" "54000" "57600" "104400" "64800" "68400"
Evidently I would like to have a POSIXct column containing hours. Any thoughts?
You should explicitly do the conversion yourself
#sample data
horas.am <- seq(as.POSIXct("2014-01-01 05:00:00"), length.out=10, by="2 hours")
data2014.im.t <- data.frame(a=1:10, b=rep("a",10), stringsAsFactors=FALSE)
class(data2014.im.t[,2])
# [1] "character"
class(horas.am)
# [1] "POSIXct" "POSIXt"
# NO:
data2014.im.t[grep("([a])", data2014.im.t[,2]), 2] <- horas.am
# YES
data2014.im.t[grep("([a])", data2014.im.t[,2]), 2] <- as.character(horas.am)
data2014.im.t
# a b
# 1 1 2014-01-01 05:00:00
# 2 2 2014-01-01 07:00:00
# 3 3 2014-01-01 09:00:00
# 4 4 2014-01-01 11:00:00
# 5 5 2014-01-01 13:00:00
# 6 6 2014-01-01 15:00:00
# 7 7 2014-01-01 17:00:00
# 8 8 2014-01-01 19:00:00
# 9 9 2014-01-01 21:00:00
# 10 10 2014-01-01 23:00:00
class(data2014.im.t[,2])
# [1] "character"