Convert heterogeneous date character vectors to date format (R) - r

I'm trying to convert a character column with dates to date format. However, the dates are in an ambiguous format. Some entries are of the format %d.%m.%Y (e.g., "03.02.2021"), while others are %d %b %Y (e.g., "3 Feb 2021").
I've tried as.Date(tryFormats=c("%d %b %Y", "%d.%m.%Y")), but realized that tryFormats is only flexible for the first entry, so that the entries of type %d %b %Y are correctly identified but those of %d.%m.%Y become NAs, or vice versa. I've also tried the anytime package, but that produced NAs in a similar fashion.
I've made sure that the column doesn't contain any NAs or empty strings, and I don't receive any error message.

Try the parsedate package :
df <-read.table(header=TRUE,text=
"d
03.02.2021
'3 Feb 2021'
13/3/2021
13-3-2020")
df %>% mutate(date=parsedate::parse_date(d))
## d date
##1 03.02.2021 2021-02-03
##2 3 Feb 2021 2021-02-03
##3 13/3/2021 2021-03-13
##4 13-3-2020 2020-03-13

Similar (but expanded) to Roland's suggestion, my answer here (in the (2) section) suggests a way to deal with multiple candidate formats.
## sample data
x <- c("03.02.2021", "3 Feb 2021")
formats <- c("%d.%m.%Y", "%d %b %Y")
dates <- as.Date(rep(NA, length(x)))
for (fmt in formats) {
nas <- is.na(dates)
dates[nas] <- as.Date(x[nas], format=fmt)
}
dates
# [1] "2021-02-03" "2021-02-03"
It is better to have the most-frequent format first in the formats vector. One could add a quick-escape to the loop if there are many formats, such as
for (fmt in formats) {
nas <- is.na(dates)
if (!any(nas)) break
dates[nas] <- as.Date(x[nas], format=fmt)
}
but I suspect that it really won't be very beneficial unless both formats and x are rather large (I have no sizes in mind to quantify "large").

did you try lubridate ?
df <-read.table(header=TRUE,text=
"d
03.02.2021
'3 Feb 2021'
13/3/2021
13-3-2020")
dmy(df$d)
[1] "2021-02-03" "2021-02-03" "2021-03-13" "2020-03-13"

Using anydate
library(anytime)
addFormats(c("%d/m/%Y", '%d-%m-%Y') )
anydate(df$d)
[1] "2021-02-03" "2021-02-03" "2021-03-13" "2020-03-13"

Related

Timestamp conversion in R and calculating Time Difference between 2 Columns of different DFs

I need to calculate time difference in minutes/hours/days etc between 2 Date-Time columns of two dataframes, please find the details below
df1 <- data.frame (Name = c("Aks","Bob","Caty","David"),
timestamp = c("Mon Apr 1 14:23:09 1980", "Sun Jun 12 12:10:21 1975", "Fri Jan 5 18:45:10 1985", "Thu Feb 19 02:26:19 1990"))
df2 <- data.frame (Name = c("Aks","Bob","Caty","David"),
timestamp = c("Apr-01-1980 14:28:00","Jun-12-1975 12:45:10","Jan-05-1985 17:50:30","Feb-19-1990 02:28:00"))
I am facing problem in converting df1$timestamp and df2$timestamp , here POSIXct & as.Date are not working getting error - non numeric argument to binary operator
I need to calculate time diff in mins/hrs or days
One approach is strptime and indicate the appropriate directives in the datetime format:
df1$timestamp2 <- strptime(df1$timestamp, "%a %b %d %H:%M:%S %Y")
df2$timestamp2 <- strptime(df2$timestamp, "%b-%d-%Y %H:%M:%S")
In this case, you have:
%a abbreviated weekday name
%b abbreviated month name
%d day of the month
%H hour, 24-hour clock
%M minute
%S second
%Y year including century
Then you can use difftime to get the difference, and specify the units (in this case, difference expressed in hours):
difftime(df1$timestamp2, df2$timestamp2, units = "hours")
Output
Time differences in hours
[1] -0.08083333 -0.58027778 0.91111111 -0.02805556
If locale-setting prevent correct reading, try:
# Store current locale
orig_locale <- Sys.getlocale("LC_TIME")
Sys.setlocale("LC_TIME", "C")
# Convert to posix-timestamp
df1$timestamp <- as.POSIXct( df1$timestamp, format = "%a %b %d %H:%M:%S %Y")
df2$timestamp <- as.POSIXct( df2$timestamp, format = "%b-%d-%Y %H:%M:%S")
# Restore locale
Sys.setlocale("LC_TIME", orig_locale)
# Calculate difference
df2$timestamp - df1$timestamp
# Time differences in mins
# [1] 4.850000 34.816667 -54.666667 1.683333

as.Date returning NA while converting it from character

I am converting following format to date from character
January 2016
I want to convert it to following format
201601
I am using following code
df$date <- as.Date(df$date,"%B %Y")
But it returns me NA values. I have even set the locale as follows
lct<- Sys.getlocale("LC_TIME")
Sys.setlocale("LC_TIME",lct)
But it gives me NA values. How to fix it
We can do this easily with as.yearmon and format
library(zoo)
format(as.yearmon(str1), "%Y%m")
#[1] "201601"
If we are going by the as.Date route, then 'Date' requires day also, so, paste a day and then use format after converting to 'Date'
format(as.Date(paste(str1, '01'), "%B %Y %d") , "%Y%m")
data
str1 <- "January 2016"

Dealing with date-time string that has day of the week

I have a date-time string that has day of the week and some meta-data in the string.
d <- "Fri, 14 Jul 2000 06:59:00 -0700 (PDT)"
I need to convert it into a date-time object (e.g. I have a column of these in a data.table) for further analysis. I have dealt with this using regexes to strip off meta-data from the string. Is there a better approach?
What I have is:
m <- regexpr("^\\w+,\\s+", d, perl=TRUE)
regmatches(d, m)
m <- regexpr("\\s-?\\d+\\s\\(\\w+\\)$", d, perl=TRUE)
regmatches(d, m)
ds <- sub("^\\w+,\\s+", "", d)
ds <- sub("\\s-?\\d+\\s\\(\\w+\\)$", "", ds)
Now I can convert this to date-time objects of class Date, Posixlt or Posixct for use in analysis.
dd <- strptime(ds, format="%d %b %Y %H:%M:%S")
dd <- as.Date(ds, format="%d %b %Y %H:%M:%S")
dd <- as.POSIXct(ds, format="%d %b %Y %H:%M:%S")
I wrote the anytime package to help with (among other things) these silly format strings -- so it heuristically just tries a number of them (and focuses on sane ones).
The input you have here qualifies (and is in fact a pretty common form):
R> anytime("Fri, 14 Jul 2000 06:59:00 -0700 (PDT)")
[1] "2000-07-14 06:59:00 CDT"
R>
We do not currently try to capture the timezone offset information at the end, so you have to deal with that after the fact. The display is in CDT which is my local timezone.
There is some more information about anytime on its webpage.
assuming the format of string is going to be constant across your data :
time = trimws(unlist(strsplit(d, "[,-]"))[2])
#[1] "14 Jul 2000 06:59:00"
tz = unlist(strsplit(d, "[,-]"))[3]
tz = gsub("[^A-Z]", "", tz)
#[1] "PDT"
> as.Date(time, format = "%d %b %Y")
[1] "2000-07-14"
> as.POSIXct(time, format = "%d %b %Y %H:%M:%S") #specify th etimezone with tz
[1] "2000-07-14 06:59:00 IST"

Sequential numbering for each month on a period of time in R

I have set of dates for a period of 10 years starting April 2006 till August 2016 i.e. 125 months. I want to identify each month by marking them out by sequential numbering starting from "1" till "125" in corresponding column (new column).
Example:
All dates in Apr'2006 will be identified as 1...May'2006 as 2 ...... Aug'2016 as 125.
Dates in the data set is in format type.
Requesting guidance on how to achieve this.
Assume that you start with a vector of dates in factor format:
x<- as.factor(c("8/7/2006", "12/13/2006", "12/14/2006"))
First you should convert this vector to Date format. In your case this can be done like this
x<- as.Date(x, format= "%m/%d/%Y")
Using the format command you can delete the day of a specific date:
format(x, "%Y %m")
> "2006 08" "2006 12" "2006 12"
This way you get rid of the day and just keep year and month.
Next you define a reference vector which contains all months from April 2006 to August 2016:
ref<- seq(from= as.Date("04/01/2006", format= "%m/%d/%Y"), to= as.Date("08/01/2016", format= "%m/%d/%Y"), length.out = 125)
ref<- format(ref, "%Y %m").
Finally you compare the entries from x with the entries from ref. This can be done with the sapply function which basically applies a function to each component of x. Here, the function it applies is the function:
myfun<-function(z) {
which(ref == format(z, "%Y %m"))
}
But since you do not need the function myfun elsewhere you can directly plug it into the sapply funtion. In the end you use the command unlist, so you get a vector.
sapply(x, function(z) which(ref == format(z, "%Y %m")))
> 6 10 10
should do the trick.
Using lubridate to format the dates:
library(lubridate)
# Create a data frame from the string below, as a factor variable
dat <- '8/7/2006 12/13/2006 12/14/2006 12/15/2006 12/16/2006 8/28/2007 8/29/2007 4/22/2008 4/23/2008 4/24/2008 4/25/2008 4/28/2008 4/29/2008 4/30/2008 5/1/2008 5/2/2008 5/7/2016 5/7/2016 5/7/2016 5/7/2016 6/26/2016 7/4/2016 7/31/2016 8/28/2016'
test_df <- data.frame(original=as.factor(strsplit(dat, ' ')[[1]]))
# We will need to convert the dates to strings in the right format
test_df$converted_string <- as.character(floor_date(mdy(test_df$original), unit="month"))
# Create a lookup table
my_months <- seq(125)
names(my_months) <- seq(as.Date('2006-04-01'), by='month', length.out=125)
# Do the lookup
test_df$converted_int <- my_months[test_df$converted_string]

Convert 12 hour to 24 hour format in R

I am not getting the right conversion when I try to convert 12 hours to 24 hours. My script (with sample data is below).
library(lubridate)
library(data.table)
# Read Sample Data
df <- data.frame(c("April 22 2016 10:49:15 AM","April 22 2016 10:01:21 AM","April 22 2016 09:06:40 AM","April 21 2016 09:50:49 PM","April 21 2016 06:07:18 PM"))
colnames(df) <- c("Date") # Set Column name
dt <- setDT(df) # Convert to data.table
ff <- function(x) as.POSIXlt(strptime(x,"%B %d %Y %H:%M:%S %p"))
dt[,dates := as.Date(ff(Date))]
When I try creating a new variable called TOD, I get the output in H:M:S format without converting it into 24 hour format. What I mean is that for the 3rd row, instead of getting 21:50:49 I get 09:50:49. I tried two different ways to do this. One use as.ITime from data.table and then also using strptime. The code I use to calculate TOD is below.
dt[,TOD1 := as.ITime(ff(Date))]
dt$TOD2 <- format(strptime(dt$Date, "%B %d %Y %H:%M:%S %p"), format="%I:%M:%S")
I thought of trying it using dataframe instead of data.table to eliminate any issues with using strptime in data.table and still got the same answer.
df$TOD <- format(strptime(df$Date, "%B %d %Y %H:%M:%S %p"), format="%I:%M:%S") # Using dataframe instead of data.table
Any insights on how to get the right answer?
As commented #lmo, you need to use %I parameter instead of %H, from ?strptime:
%H
Hours as decimal number (00–23). As a special exception strings
such as 24:00:00 are accepted for input, since ISO 8601 allows these.
%I
Hours as decimal number (01–12).
strptime("April 21 2016 09:50:49 PM", "%B %d %Y %I:%M:%S %p")
# [1] "2016-04-21 21:50:49 EDT"
Here you go:
library(lubridate)
df$Date <- mdy_hms(df$Date)
Note that while mdy_hms is extremely convenient and takes care of the 12 / 24 hour time for you, it will automatically assign UTC as a time zone. You can specify a different one if you need. You can then convert df to a data.table if you like.

Resources