Convert non-standard time data in R - r

I have a dataset with non-standard time data - the Excel file has numbers in a variety of different of different formats as shown below.
Trying to convert it to something usable in R - probably HH:MM AM/PM so I use mutate(H1B.format = format(strptime(H1B, "%I:%M %p"), "%H:%M")).
How would I do this - I tried: separate(H1B, into = c('time', 'ampm'), sep = -2, convert = TRUE) to put AM/PM into a separate column, but still need to figure out how to add colons and zeros as needed.
I'm also fairly new to R, so any help would be great!

You can use lubridate::parse_date_time to parse times in various formats.
library("lubridate")
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
times_to_parse <- c(
"10:30PM", "9:30pm", "12am", "10pm", "6am", "5:30pm",
"1015PM", "1030pm"
)
time_formats <- c(
"%I:%M %p", "%I %p"
)
lubridate::parse_date_time(
times_to_parse, time_formats
)
#> [1] "0000-01-01 22:30:00 UTC" "0000-01-01 21:30:00 UTC"
#> [3] "0000-01-01 00:00:00 UTC" "0000-01-01 22:00:00 UTC"
#> [5] "0000-01-01 06:00:00 UTC" "0000-01-01 17:30:00 UTC"
#> [7] "0000-01-01 22:15:00 UTC" "0000-01-01 22:30:00 UTC"
Created on 2022-03-15 by the reprex package (v2.0.1)

Related

Plot a histogram of yearly counts

I have a csv file that consists of one column. The column presents the date of posting on a website. I want to plot a histogram to see how the number of posts varies over the years. The file contains the years (2012 to 2016) and consists of 11,000 rows.
sample of the file:
2 30/1/12 21:07
3 2/2/12 15:53
4 3/4/12 0:49
5 14/11/12 3:49
6 11/8/13 16:00
7 31/7/14 8:08
8 31/7/14 10:48
9 6/8/14 9:24
10 16/12/14 3:34
The data types is dataframe
class(postsData)
[1] "data.frame"
I tried converting the data to text using strptime function as below:
formatDate <- strptime(as.character(postsData$Date),format="“%d/%m/%y")
then plot the histogram
hist(formatDate,breaks=10,xlab="year")
Any tip or suggestion would be useful. Thank you,
use lubridate::dmy_hm()
strptime() is overly complicated in my opinion compared to { lubridate }.
library(lubridate)
d <- c("30/1/12 21:07",
"2/2/12 15:53",
"3/4/12 0:49",
"14/11/12 3:49",
"11/8/13 16:00",
"31/7/14 8:08",
"31/7/14 10:48",
"6/8/14 9:24",
"16/12/14 3:34")
d2 <- dmy_hm(d)
d2
Returns:
[1] "2012-01-30 21:07:00 UTC"
[2] "2012-02-02 15:53:00 UTC"
[3] "2012-04-03 00:49:00 UTC"
[4] "2012-11-14 03:49:00 UTC"
[5] "2013-08-11 16:00:00 UTC"
[6] "2014-07-31 08:08:00 UTC"
[7] "2014-07-31 10:48:00 UTC"
[8] "2014-08-06 09:24:00 UTC"
[9] "2014-12-16 03:34:00 UTC"
As you can see, lubridate functions return POSIXct objects.
class(d2)
[1] "POSIXct" "POSIXt"
Next, you can use lubridate::year() to get the year of each POSIXct object returned by dmy_hm(), and plot that histogram.
hist(year(d2))
Here's one approach. I think your date conversion is fine but you need to count the number of dates that occur in each year then plot that count as a histogram.
library(tidyverse)
# generate some data
date.seq <- tibble(xdate = seq(from = lubridate::ymd_hms('2000-01-01 00:00:00'), to=lubridate::ymd_hms('2016-12-31 24:59:59'), length.out = 100))
date.seq %>%
mutate(xyear = lubridate::year(xdate)) %>% # add a column of years
group_by(xyear) %>%
summarise(date_count = length(xdate)) %>% # Count the number of dates that occur in each year
ggplot(aes(x = xyear, y = date_count)) +
geom_col(colour = 'black', fill = 'blue') # plot as a column graph
There's no problem with strptime()*, however, the format option is intended to specify how the is formatted.
df1$date <- strptime(df1$date, format="%d/%m/%y %H:%M")
# [1] "2012-01-30 21:07:00 CET" "2012-02-02 15:53:00 CET"
# [3] "2012-04-03 00:49:00 CEST" "2012-11-14 03:49:00 CET"
# [5] "2013-08-11 16:00:00 CEST" "2014-07-31 08:08:00 CEST"
# [7] "2014-07-31 10:48:00 CEST" "2014-08-06 09:24:00 CEST"
# [9] "2014-12-16 03:34:00 CET"
What you probably want then is to use the format() function
formatDate <- format(df1$date, format="%F")
(or in this case simpler with formatDate <- as.Date(df1$date))
and then
hist(formatDate, breaks=10, xlab="year")
* credits to #MikkoMarttila
Data
df1 <- structure(list(id = 2:10, date = c("30/1/12 21:07", "2/2/12 15:53",
"3/4/12 0:49", "14/11/12 3:49", "11/8/13 16:00", "31/7/14 8:08",
"31/7/14 10:48", "6/8/14 9:24", "16/12/14 3:34")), class = "data.frame", row.names = c(NA,
-9L))

parse_date_time() converting DayofYear in date

Hi I'm using the lubridate package and
I want to convert a vector from 1:365 (day of year) in a date format:
e.g. 60 -> 2019-03-01 UTC.
For 1-99 it works fine, but for 100-365 I get a warning massage.
lubridate::parse_date_time(99, "j")
[1] "2019-04-09 UTC"
lubridate::parse_date_time(100:365, "j")
[1] NA ...
[365] NA
Warning message:
All formats failed to parse. No formats found.
Gets anyone the same warning massage or has a solution?
If you provide character input, it works well
lubridate::parse_date_time('100', "j")
# [1] "2019-04-10 UTC"
lubridate::parse_date_time(paste(100:365), "j")
# [1] "2019-04-10 UTC" "2019-04-11 UTC" "2019-04-12 UTC" "2019-04-13 UTC" "2019-04-14 UTC" "2019-04-15 UTC" "2019-04-16 UTC" "2019-04-17 UTC"
# ...
# [265] "2019-12-30 UTC" "2019-12-31 UTC
you can easily do it with specifying origin date using
as.Date(100:365, format = "%j", origin = "01-01-2019")

Decompose xts hourly time series

I want to decompose hourly time series with decompose, ets, or stl or whatever function. Here is an example code and its output:
require(xts)
require(forecast)
time_index1 <- seq(from = as.POSIXct("2012-05-15 07:00"),
to = as.POSIXct("2012-05-17 18:00"), by="hour")
head(time_index1 <- format(time_index1, format="%Y-%m-%d %H:%M:%S",
tz="UTC", usetz=TRUE)
# [1] "2012-05-15 05:00:00 UTC" "2012-05-15 06:00:00 UTC"
# [3] "2012-05-15 07:00:00 UTC" "2012-05-15 08:00:00 UTC"
# [5] "2012-05-15 09:00:00 UTC" "2012-05-15 10:00:00 UTC"
head(time_index <- as.POSIXct(time_index1))
# [1] "2012-05-15 05:00:00 CEST" "2012-05-15 06:00:00 CEST"
# [3] "2012-05-15 07:00:00 CEST" "2012-05-15 08:00:00 CEST"
# [5] "2012-05-15 09:00:00 CEST" "2012-05-15 10:00:00 CEST"
Why does the timezone for time_index change back to CEST?
set.seed(1)
value <- rnorm(n = length(time_index1))
eventdata1 <- xts(value, order.by = time_index)
tzone(eventdata1)
# [1] ""
head(index(eventdata1))
# [1] "2012-05-15 05:00:00 CEST" "2012-05-15 06:00:00 CEST"
# [3] "2012-05-15 07:00:00 CEST" "2012-05-15 08:00:00 CEST"
# [5] "2012-05-15 09:00:00 CEST" "2012-05-15 10:00:00 CEST"
ets(eventdata1)
# ETS(A,N,N)
#
# Call:
# ets(y = eventdata1)
#
# Smoothing parameters:
# alpha = 1e-04
#
# Initial states:
# l = 0.1077
#
# sigma: 0.8481
#
# AIC AICc BIC
# 229.8835 230.0940 234.0722
decompose(eventdata1)
# Error in decompose(eventdata1) :
# time series has no or less than 2 periods
stl(eventdata1)
# Error in stl(eventdata1) :
# series is not periodic or has less than two periods
When I call tzone or indexTZ there is no timezone but the index clearly show that the times are defined with a timezone.
Also, why does only ets work? Can it be used to decompose a time series?
Why does the timezone for time_index change back to CEST?
Because you didn't specify tz= in your call to as.POSIXct. It will only pick up the timezone from the string if it's specified by offset from UTC (e.g. -0800). See ?strptime.
R> head(time_index <- as.POSIXct(time_index1, "UTC"))
[1] "2012-05-15 12:00:00 UTC" "2012-05-15 13:00:00 UTC"
[3] "2012-05-15 14:00:00 UTC" "2012-05-15 15:00:00 UTC"
[5] "2012-05-15 16:00:00 UTC" "2012-05-15 17:00:00 UTC"
When I call tzone or indexTZ there is no timezone but the index clearly show that the times are defined with a timezone.
All POSIXct objects have a timezone. A timezone of "" simply means R wasn't able to determine a specific timezone, so it is using the timezone specified by your operating system. See ?timezone.
Only the ets function works because your xts object doesn't have a properly defined frequency attribute. This is a known limitation of xts objects, and I have plans to address them over the next several months. You can work around the current issues by explicitly specifying the frequency attribute after calling the xts constructor.
R> set.seed(1)
R> value <- rnorm(n = length(time_index1))
R> eventdata1 <- xts(value, order.by = time_index)
R> attr(eventdata1, 'frequency') <- 24 # set frequency attribute
R> decompose(as.ts(eventdata1)) # decompose expects a 'ts' object
You can use tbats to decompose hourly data:
require(forecast)
set.seed(1)
time_index1 <- seq(from = as.POSIXct("2012-05-15 07:00"),
to = as.POSIXct("2012-05-17 18:00"), by="hour")
value <- rnorm(n = length(time_index1))
eventdata1 <- msts(value, seasonal.periods = c(24) )
seasonaldecomp <- tbats(eventdata1)
plot(seasonaldecomp)
Additionally, using msts instead of xts allows you to specify multiple seasons/cycles, fore instance hourly as well as daily: c(24, 24*7)

Best way to deal with differing date data [duplicate]

I am trying to do some simple operation in R, after loading a table i encountered a date column which has many formats combined.
**Date**
1/28/14 6:43 PM
1/29/14 4:10 PM
1/30/14 12:09 PM
1/30/14 12:12 PM
02-03-14 19:49
02-03-14 20:03
02-05-14 14:33
I need to convert this to format like 28-01-2014 18:43 i.e. %d-%m-%y %h:%m
I tried this
tablename$Date <- as.Date(as.character(tablename$Date), "%d-%m-%y %h:%m")
but doing this its filling NA in the entire column. Please help me to get this right!
The lubridate package makes quick work of this:
library(lubridate)
d <- parse_date_time(dates, names(guess_formats(dates, c("mdy HM", "mdy IMp"))))
d
## [1] "2014-01-28 18:43:00 UTC" "2014-01-29 16:10:00 UTC"
## [3] "2014-01-30 12:09:00 UTC" "2014-01-30 12:12:00 UTC"
## [5] "2014-02-03 19:49:00 UTC" "2014-02-03 20:03:00 UTC"
## [7] "2014-02-05 14:33:00 UTC"
# put in desired format
format(d, "%m-%d-%Y %H:%M:%S")
## [1] "01-28-2014 18:43:00" "01-29-2014 16:10:00" "01-30-2014 12:09:00"
## [4] "01-30-2014 12:12:00" "02-03-2014 19:49:00" "02-03-2014 20:03:00"
## [7] "02-05-2014 14:33:00"
You'll need to adjust the vector in guess_formats if you come across other format variations.

Changing dates in different time zones by adding to POSIXlt

I am running into an error when I try to localize times for "date" (a variable of class=POSIXlt) in my dataset. Example code is as follows:
# All dates are coded by survey software in EST(not local time)
date <- c("2011-07-26 07:23", "2011-07-29 07:34", "2011-07-29 07:40")
region <-c("USA-EST", "UK", "Singapore")
#Change the times based on time-zone differences
start_time<-strptime(date,"%Y-%m-%d %h:%m")
localtime=as.POSIXlt(start_time)
localtime<-ifelse(region=="UK",start_time+6,start_time)
localtime<-ifelse(region=="Singapore",start_time+12,start_time)
#Then, I need to extract the hour and weekday
weekday<-weekdays(localtime)
hour<-factor(localtime)
There must be something wrong with my "ifelse" statement, because I get the error: number of items to replace is not a multiple of replacement length. Please help!
How about using R's native time code? The trick is that you can't have more than one time-zone in a POSIX vector, so use a list instead:
region <- c("EST","Europe/London","Asia/Singapore")
(localtime <- lapply(seq(date),function(x) as.POSIXlt(date[x],tz=region[x])))
[[1]]
[1] "2011-07-26 07:23:00 EST"
[[2]]
[1] "2011-07-29 07:34:00 Europe/London"
[[3]]
[1] "2011-07-29 07:40:00 Asia/Singapore"
And to convert to a vector in a single timezone:
Reduce("c",localtime)
[1] "2011-07-26 13:23:00 BST" "2011-07-29 07:34:00 BST"
[3] "2011-07-29 00:40:00 BST"
Note that my system timezone is BST, but if yours is EST it will convert to that.
You can use the timezone handling built in in POSIXct:
> start_time <- as.POSIXct(date,"%Y-%m-%d %H:%M", tz = "America/New_York")
> start_time
[1] "2011-07-26 07:23:00 EDT" "2011-07-29 07:34:00 EDT" "2011-07-29 07:40:00 EDT"
> format(start_time, tz="Europe/London", usetz=TRUE)
[1] "2011-07-26 12:23:00 BST" "2011-07-29 12:34:00 BST" "2011-07-29 12:40:00 BST"
> format(start_time, tz="Asia/Singapore", usetz=TRUE)
[1] "2011-07-26 19:23:00 SGT" "2011-07-29 19:34:00 SGT" "2011-07-29 19:40:00 SGT"

Resources